Re: [PATCH v2 1/1] drm/amdgpu: Fix MMIO HDP flush on SRIOV

2021-11-17 Thread Lazar, Lijo




On 11/18/2021 5:52 AM, Felix Kuehling wrote:

On 2021-11-10 11:34 a.m., Felix Kuehling wrote:

Am 2021-11-10 um 11:11 a.m. schrieb Lazar, Lijo:

[Public]



(... && !amdgpu_sriov_vf(adev))

This kind of closes the door for all versions. My thought was - having
it in the same function provides a logical grouping for how it's
handled for different cases - VF vs non-VF - for a particular IP 
version.

Except that's not really true. There is still common code (setting
adev->rmmio_remap.bus_addr) for handling MMIO remapping on VFs in
soc15.c and nv.c. I'm not comfortable with duplicating all of that
across all the IP version-specific files.

I also think it's very unlikely that a small IP version bump is going to
change this logic. So I'd rather prefer to keep the code more general
and conservatively correct.


Hi Lijo,

The virtualization team has finished testing this patch and wants me to 
submit it. Do you insist I rework the patch to move all the 
adev->rmmio_remap fixups for virtualization into the nbio 
version-specific remap_hdp_registers functions?




Not required, it's fine -

Reviewed-by: Lijo Lazar 

Thanks,
Lijo


Regards,
   Felix




Regards,
   Felix



Thanks,
Lijo

*From:* Kuehling, Felix 
*Sent:* Wednesday, November 10, 2021 9:27:22 PM
*To:* Lazar, Lijo ; amd-gfx@lists.freedesktop.org

*Cc:* Zhang, Bokun 
*Subject:* Re: [PATCH v2 1/1] drm/amdgpu: Fix MMIO HDP flush on SRIOV
Am 2021-11-10 um 4:14 a.m. schrieb Lazar, Lijo:

On 11/10/2021 8:00 AM, Felix Kuehling wrote:

Disable HDP register remapping on SRIOV and set rmmio_remap.reg_offset
to the fixed address of the VF register for hdp_v*_flush_hdp.

Signed-off-by: Felix Kuehling 
---
   drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c | 4 
   drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c | 4 
   drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c | 4 +++-
   drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c | 4 
   drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 4 +++-
   drivers/gpu/drm/amd/amdgpu/nv.c    | 8 +---
   drivers/gpu/drm/amd/amdgpu/soc15.c | 8 +---
   7 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c
b/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c
index 4ecd2b5808ce..ee7cab37dfd5 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c
@@ -359,6 +359,10 @@ static void nbio_v2_3_init_registers(struct
amdgpu_device *adev)
     if (def != data)
   WREG32_PCIE(smnPCIE_CONFIG_CNTL, data);
+
+    if (amdgpu_sriov_vf(adev))
+    adev->rmmio_remap.reg_offset = SOC15_REG_OFFSET(NBIO, 0,
+    mmBIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL) 
<< 2;

Wouldn't it be better to do this assignment inside
remap_hdp_registers()? Return with a comment saying no remap is done
for VFs. That looks easier to manage per IP version.

I was considering that. I felt it was clearer not to have that hidden
side effect in remap_hdp_registers and to have the explicit condition
(... &&  !amdgpu_sriov_vf(adev)) around the remap_hdp_registers call in
soc15/nv_common_hw_init.

Regards,
   Felix



Thanks,
Lijo


   }
     #define NAVI10_PCIE__LC_L0S_INACTIVITY_DEFAULT    0x
// off by default, no gains over L1
diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c
b/drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c
index 0d2d629e2d6a..4bbacf1be25a 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c
@@ -276,6 +276,10 @@ static void nbio_v6_1_init_registers(struct
amdgpu_device *adev)
     if (def != data)
   WREG32_PCIE(smnPCIE_CI_CNTL, data);
+
+    if (amdgpu_sriov_vf(adev))
+    adev->rmmio_remap.reg_offset = SOC15_REG_OFFSET(NBIO, 0,
+    mmBIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL) 
<< 2;

   }
     static void nbio_v6_1_program_ltr(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c
b/drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c
index 3c00666a13e1..37a4039fdfc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c
@@ -273,7 +273,9 @@ const struct nbio_hdp_flush_reg
nbio_v7_0_hdp_flush_reg = {
     static void nbio_v7_0_init_registers(struct amdgpu_device *adev)
   {
-
+    if (amdgpu_sriov_vf(adev))
+    adev->rmmio_remap.reg_offset =
+    SOC15_REG_OFFSET(NBIO, 0,
mmHDP_MEM_COHERENCY_FLUSH_CNTL) << 2;
   }
     const struct amdgpu_nbio_funcs nbio_v7_0_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c
b/drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c
index 8f2a315e7c73..3444332ea110 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c
@@ -371,6 +371,10 @@ static void nbio_v7_2_init_registers(struct
amdgpu_device *adev)
   if (def != data)
   WREG32_PCIE_PORT(SOC15_REG_OFFSET(NBIO, 0,
regPCIE_CONFIG_CNTL), data);
   }
+
+    if (amdgpu_sriov_vf(adev))
+    adev->rmmio_rem

[PATCH] drm/amdgpu: reset asic after system-wide suspend aborted

2021-11-17 Thread Prike Liang
Do ASIC reset at the moment Sx suspend aborted behind of amdgpu suspend
to keep AMDGPU in a clean reset state and that can avoid re-initialize
device improperly error.

Signed-off-by: Prike Liang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 19 +++
 3 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b85b67a..8bd9833 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1053,6 +1053,7 @@ struct amdgpu_device {
boolin_s3;
boolin_s4;
boolin_s0ix;
+   boolpm_completed;
 
atomic_tin_gpu_reset;
enum pp_mp1_state   mp1_state;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ec42a6f..a12ed54 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3983,6 +3983,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
if (adev->in_s0ix)
amdgpu_gfx_state_change_set(adev, sGpuChangeState_D0Entry);
 
+   if (!adev->pm_completed) {
+   dev_warn(adev->dev, "suspend aborted will do asic reset\n");
+   amdgpu_asic_reset(adev);
+   }
/* post card */
if (amdgpu_device_need_post(adev)) {
r = amdgpu_device_asic_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index eee3cf8..9f90017 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2168,6 +2168,23 @@ static int amdgpu_pmops_suspend(struct device *dev)
return r;
 }
 
+/*
+ * Actually the PM suspend whether is completed should be confirmed
+ * by checking the sysfs sys/power/suspend_stats/failed_suspend.However,
+ * in this function only check the AMDGPU device whether is suspended
+ * completely in the system-wide suspend process.
+ */
+static int amdgpu_pmops_noirq_suspend(struct device *dev)
+{
+   struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+
+   dev_dbg(dev, "amdgpu suspend completely.\n");
+   adev->pm_completed = true;
+
+   return 0;
+}
+
 static int amdgpu_pmops_resume(struct device *dev)
 {
struct drm_device *drm_dev = dev_get_drvdata(dev);
@@ -2181,6 +2198,7 @@ static int amdgpu_pmops_resume(struct device *dev)
r = amdgpu_device_resume(drm_dev, true);
if (amdgpu_acpi_is_s0ix_active(adev))
adev->in_s0ix = false;
+   adev->pm_completed = false;
return r;
 }
 
@@ -2397,6 +2415,7 @@ static const struct dev_pm_ops amdgpu_pm_ops = {
.runtime_suspend = amdgpu_pmops_runtime_suspend,
.runtime_resume = amdgpu_pmops_runtime_resume,
.runtime_idle = amdgpu_pmops_runtime_idle,
+   .suspend_noirq = amdgpu_pmops_noirq_suspend,
 };
 
 static int amdgpu_flush(struct file *f, fl_owner_t id)
-- 
2.7.4



Re: [PATCH v2] drm/amdkfd: Retrieve SDMA numbers from amdgpu

2021-11-17 Thread Felix Kuehling
Am 2021-11-18 um 12:39 a.m. schrieb Amber Lin:
> Instead of hard coding the number of sdma engines and the number of
> sdma_xgmi engines in the device_info table, get the number of toal SDMA
> instances from amdgpu. The first two engines are sdma engines and the
> rest are sdma-xgmi engines unless the ASIC doesn't support XGMI.
>
> v2: add kfd_ prefix to non static function names
>
> Signed-off-by: Amber Lin 

Reviewed-by: Felix Kuehling 


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c   | 20 
>  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 32 +++
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  3 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c |  4 +--
>  4 files changed, 37 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index ce9f4e562bac..3fea47e37c17 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -1516,6 +1516,26 @@ void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, 
> uint64_t throttle_bitmask)
>   kfd_smi_event_update_thermal_throttling(kfd, throttle_bitmask);
>  }
>  
> +/* kfd_get_num_sdma_engines returns the number of PCIe optimized SDMA and
> + * kfd_get_num_xgmi_sdma_engines returns the number of XGMI SDMA.
> + * When the device has more than two engines, we reserve two for PCIe to 
> enable
> + * full-duplex and the rest are used as XGMI.
> + */
> +unsigned int kfd_get_num_sdma_engines(struct kfd_dev *kdev)
> +{
> + /* If XGMI is not supported, all SDMA engines are PCIe */
> + if (!kdev->adev->gmc.xgmi.supported)
> + return kdev->adev->sdma.num_instances;
> +
> + return min(kdev->adev->sdma.num_instances, 2);
> +}
> +
> +unsigned int kfd_get_num_xgmi_sdma_engines(struct kfd_dev *kdev)
> +{
> + /* After reserved for PCIe, the rest of engines are XGMI */
> + return kdev->adev->sdma.num_instances - kfd_get_num_sdma_engines(kdev);
> +}
> +
>  #if defined(CONFIG_DEBUG_FS)
>  
>  /* This function will send a package to HIQ to hang the HWS
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 62fe28244a80..2af2b3268171 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -99,31 +99,22 @@ unsigned int get_pipes_per_mec(struct 
> device_queue_manager *dqm)
>   return dqm->dev->shared_resources.num_pipe_per_mec;
>  }
>  
> -static unsigned int get_num_sdma_engines(struct device_queue_manager *dqm)
> -{
> - return dqm->dev->device_info->num_sdma_engines;
> -}
> -
> -static unsigned int get_num_xgmi_sdma_engines(struct device_queue_manager 
> *dqm)
> -{
> - return dqm->dev->device_info->num_xgmi_sdma_engines;
> -}
> -
>  static unsigned int get_num_all_sdma_engines(struct device_queue_manager 
> *dqm)
>  {
> - return get_num_sdma_engines(dqm) + get_num_xgmi_sdma_engines(dqm);
> + return kfd_get_num_sdma_engines(dqm->dev) +
> + kfd_get_num_xgmi_sdma_engines(dqm->dev);
>  }
>  
>  unsigned int get_num_sdma_queues(struct device_queue_manager *dqm)
>  {
> - return dqm->dev->device_info->num_sdma_engines
> - * dqm->dev->device_info->num_sdma_queues_per_engine;
> + return kfd_get_num_sdma_engines(dqm->dev) *
> + dqm->dev->device_info->num_sdma_queues_per_engine;
>  }
>  
>  unsigned int get_num_xgmi_sdma_queues(struct device_queue_manager *dqm)
>  {
> - return dqm->dev->device_info->num_xgmi_sdma_engines
> - * dqm->dev->device_info->num_sdma_queues_per_engine;
> + return kfd_get_num_xgmi_sdma_engines(dqm->dev) *
> + dqm->dev->device_info->num_sdma_queues_per_engine;
>  }
>  
>  void program_sh_mem_settings(struct device_queue_manager *dqm,
> @@ -1054,9 +1045,9 @@ static int allocate_sdma_queue(struct 
> device_queue_manager *dqm,
>   dqm->sdma_bitmap &= ~(1ULL << bit);
>   q->sdma_id = bit;
>   q->properties.sdma_engine_id = q->sdma_id %
> - get_num_sdma_engines(dqm);
> + kfd_get_num_sdma_engines(dqm->dev);
>   q->properties.sdma_queue_id = q->sdma_id /
> - get_num_sdma_engines(dqm);
> + kfd_get_num_sdma_engines(dqm->dev);
>   } else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI) {
>   if (dqm->xgmi_sdma_bitmap == 0) {
>   pr_err("No more XGMI SDMA queue to allocate\n");
> @@ -1071,10 +1062,11 @@ static int allocate_sdma_queue(struct 
> device_queue_manager *dqm,
>* assumes the first N engines are always
>* PCIe-optimized ones
>*/
> - q->properties.sdma_engine_id = get_num_sdma_engines(dqm) +
> - q->sdma_id % get_num_xgmi

[PATCH v2] drm/amdkfd: Retrieve SDMA numbers from amdgpu

2021-11-17 Thread Amber Lin
Instead of hard coding the number of sdma engines and the number of
sdma_xgmi engines in the device_info table, get the number of toal SDMA
instances from amdgpu. The first two engines are sdma engines and the
rest are sdma-xgmi engines unless the ASIC doesn't support XGMI.

v2: add kfd_ prefix to non static function names

Signed-off-by: Amber Lin 
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c   | 20 
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 32 +++
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  3 ++
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c |  4 +--
 4 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index ce9f4e562bac..3fea47e37c17 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1516,6 +1516,26 @@ void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, 
uint64_t throttle_bitmask)
kfd_smi_event_update_thermal_throttling(kfd, throttle_bitmask);
 }
 
+/* kfd_get_num_sdma_engines returns the number of PCIe optimized SDMA and
+ * kfd_get_num_xgmi_sdma_engines returns the number of XGMI SDMA.
+ * When the device has more than two engines, we reserve two for PCIe to enable
+ * full-duplex and the rest are used as XGMI.
+ */
+unsigned int kfd_get_num_sdma_engines(struct kfd_dev *kdev)
+{
+   /* If XGMI is not supported, all SDMA engines are PCIe */
+   if (!kdev->adev->gmc.xgmi.supported)
+   return kdev->adev->sdma.num_instances;
+
+   return min(kdev->adev->sdma.num_instances, 2);
+}
+
+unsigned int kfd_get_num_xgmi_sdma_engines(struct kfd_dev *kdev)
+{
+   /* After reserved for PCIe, the rest of engines are XGMI */
+   return kdev->adev->sdma.num_instances - kfd_get_num_sdma_engines(kdev);
+}
+
 #if defined(CONFIG_DEBUG_FS)
 
 /* This function will send a package to HIQ to hang the HWS
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 62fe28244a80..2af2b3268171 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -99,31 +99,22 @@ unsigned int get_pipes_per_mec(struct device_queue_manager 
*dqm)
return dqm->dev->shared_resources.num_pipe_per_mec;
 }
 
-static unsigned int get_num_sdma_engines(struct device_queue_manager *dqm)
-{
-   return dqm->dev->device_info->num_sdma_engines;
-}
-
-static unsigned int get_num_xgmi_sdma_engines(struct device_queue_manager *dqm)
-{
-   return dqm->dev->device_info->num_xgmi_sdma_engines;
-}
-
 static unsigned int get_num_all_sdma_engines(struct device_queue_manager *dqm)
 {
-   return get_num_sdma_engines(dqm) + get_num_xgmi_sdma_engines(dqm);
+   return kfd_get_num_sdma_engines(dqm->dev) +
+   kfd_get_num_xgmi_sdma_engines(dqm->dev);
 }
 
 unsigned int get_num_sdma_queues(struct device_queue_manager *dqm)
 {
-   return dqm->dev->device_info->num_sdma_engines
-   * dqm->dev->device_info->num_sdma_queues_per_engine;
+   return kfd_get_num_sdma_engines(dqm->dev) *
+   dqm->dev->device_info->num_sdma_queues_per_engine;
 }
 
 unsigned int get_num_xgmi_sdma_queues(struct device_queue_manager *dqm)
 {
-   return dqm->dev->device_info->num_xgmi_sdma_engines
-   * dqm->dev->device_info->num_sdma_queues_per_engine;
+   return kfd_get_num_xgmi_sdma_engines(dqm->dev) *
+   dqm->dev->device_info->num_sdma_queues_per_engine;
 }
 
 void program_sh_mem_settings(struct device_queue_manager *dqm,
@@ -1054,9 +1045,9 @@ static int allocate_sdma_queue(struct 
device_queue_manager *dqm,
dqm->sdma_bitmap &= ~(1ULL << bit);
q->sdma_id = bit;
q->properties.sdma_engine_id = q->sdma_id %
-   get_num_sdma_engines(dqm);
+   kfd_get_num_sdma_engines(dqm->dev);
q->properties.sdma_queue_id = q->sdma_id /
-   get_num_sdma_engines(dqm);
+   kfd_get_num_sdma_engines(dqm->dev);
} else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI) {
if (dqm->xgmi_sdma_bitmap == 0) {
pr_err("No more XGMI SDMA queue to allocate\n");
@@ -1071,10 +1062,11 @@ static int allocate_sdma_queue(struct 
device_queue_manager *dqm,
 * assumes the first N engines are always
 * PCIe-optimized ones
 */
-   q->properties.sdma_engine_id = get_num_sdma_engines(dqm) +
-   q->sdma_id % get_num_xgmi_sdma_engines(dqm);
+   q->properties.sdma_engine_id =
+   kfd_get_num_sdma_engines(dqm->dev) +
+   q->sdma_id % kfd_get_num_xgmi_sdma_engines(dqm->dev);
q->properties

Re: 回复: 回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table

2021-11-17 Thread Lazar, Lijo




On 11/18/2021 9:56 AM, Yang, Stanley wrote:

[AMD Official Use Only]




-邮件原件-
发件人: Lazar, Lijo 
发送时间: Thursday, November 18, 2021 12:04 PM
收件人: Yang, Stanley ; amd-
g...@lists.freedesktop.org; Zhang, Hawking ;
Clements, John ; Quan, Evan
; Wang, Yang(Kevin) 
主题: Re: 回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get
ecc_table



On 11/18/2021 9:07 AM, Yang, Stanley wrote:

[AMD Official Use Only]




-邮件原件-
发件人: Lazar, Lijo 
发送时间: Wednesday, November 17, 2021 7:24 PM
收件人: Yang, Stanley ; amd-
g...@lists.freedesktop.org; Zhang, Hawking ;
Clements, John ; Quan, Evan
; Wang, Yang(Kevin)



主题: Re: [PATCH Review 3/4] drm/amdgpu: add message smu to get
ecc_table



On 11/17/2021 3:41 PM, Stanley.Yang wrote:

support ECC TABLE message, this table include unc ras error count
and error address

Signed-off-by: Stanley.Yang 
---
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 
.../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38

+++

.../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 24 
drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
5 files changed, 74 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 3557f4e7fc30..ea65de0160c3 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -324,6 +324,7 @@ enum smu_table_id
SMU_TABLE_OVERDRIVE,
SMU_TABLE_I2C_COMMANDS,
SMU_TABLE_PACE,
+   SMU_TABLE_ECCINFO,
SMU_TABLE_COUNT,
};

@@ -340,6 +341,7 @@ struct smu_table_context
void*max_sustainable_clocks;
struct smu_bios_boot_up_values  boot_values;
void*driver_pptable;
+   void*ecc_table;
struct smu_tabletables[SMU_TABLE_COUNT];
/*
 * The driver table is just a staging buffer for @@ -1261,6
+1263,11 @@ struct pptable_funcs {
 *

of SMUBUS table.

 */
int (*send_hbm_bad_pages_num)(struct smu_context *smu,

uint32_t

size);
+
+   /**
+* @get_ecc_table:  message SMU to get ECC INFO table.
+*/
+   ssize_t (*get_ecc_info)(struct smu_context *smu, void *table);
};

typedef enum {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index f835d86cc2f5..5e4ba0e14a91 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -190,6 +190,7 @@ static const struct cmn2asic_mapping

aldebaran_table_map[SMU_TABLE_COUNT] = {

TAB_MAP(SMU_METRICS),
TAB_MAP(DRIVER_SMU_CONFIG),
TAB_MAP(I2C_COMMANDS),
+   TAB_MAP(ECCINFO),
};

static const uint8_t aldebaran_throttler_map[] = { @@ -223,6
+224,9 @@ static int aldebaran_tables_init(struct smu_context *smu)
SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS,

sizeof(SwI2cRequest_t),

   PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);

+   SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO,

sizeof(EccInfoTable_t),

+  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
+
smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t),

GFP_KERNEL);

if (!smu_table->metrics_table)
return -ENOMEM;
@@ -235,6 +239,10 @@ static int aldebaran_tables_init(struct
smu_context

*smu)

return -ENOMEM;
}

+   smu_table->ecc_table = kzalloc(tables[SMU_TABLE_ECCINFO].size,

GFP_KERNEL);

+   if (!smu_table->ecc_table)
+   return -ENOMEM;
+
return 0;
}

@@ -1765,6 +1773,35 @@ static ssize_t
aldebaran_get_gpu_metrics(struct

smu_context *smu,

return sizeof(struct gpu_metrics_v1_3);
}

+static ssize_t aldebaran_get_ecc_info(struct smu_context *smu,
+void *table)
+{
+   struct smu_table_context *smu_table = &smu->smu_table;
+   EccInfoTable_t ecc_table;
+   struct ecc_info_per_ch *ecc_info_per_channel = NULL;
+   int i, ret = 0;
+   struct umc_ecc_info *eccinfo = (struct umc_ecc_info *)table;
+
+   ret = smu_cmn_get_ecc_info_table(smu,
+   &ecc_table);
+   if (ret)
+   return ret;
+
+   for (i = 0; i < ALDEBARAN_UMC_CHANNEL_NUM; i++) {
+   ecc_info_per_channel = &(eccinfo->ecc[i]);
+   ecc_info_per_channel->ce_count_lo_chip =
+   ecc_table.EccInfo[i].ce_count_lo_chip;
+   ecc_info_per_channel->ce_count_hi_chip =
+   ecc_table.EccInfo[i].ce_count_hi_chip;
+   ecc_info_per_channel->mca_umc_status =
+   ecc_table.EccInfo[i].mca_umc_status;
+   ecc_info_per_channel->mca_umc_addr =
+   ecc_table

回复: 回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table

2021-11-17 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Thursday, November 18, 2021 12:04 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Quan, Evan
> ; Wang, Yang(Kevin) 
> 主题: Re: 回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get
> ecc_table
> 
> 
> 
> On 11/18/2021 9:07 AM, Yang, Stanley wrote:
> > [AMD Official Use Only]
> >
> >
> >
> >> -邮件原件-
> >> 发件人: Lazar, Lijo 
> >> 发送时间: Wednesday, November 17, 2021 7:24 PM
> >> 收件人: Yang, Stanley ; amd-
> >> g...@lists.freedesktop.org; Zhang, Hawking ;
> >> Clements, John ; Quan, Evan
> >> ; Wang, Yang(Kevin)
> 
> >> 主题: Re: [PATCH Review 3/4] drm/amdgpu: add message smu to get
> >> ecc_table
> >>
> >>
> >>
> >> On 11/17/2021 3:41 PM, Stanley.Yang wrote:
> >>> support ECC TABLE message, this table include unc ras error count
> >>> and error address
> >>>
> >>> Signed-off-by: Stanley.Yang 
> >>> ---
> >>>drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 
> >>>.../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38
> >> +++
> >>>.../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
> >>>drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 24 
> >>>drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
> >>>5 files changed, 74 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>> index 3557f4e7fc30..ea65de0160c3 100644
> >>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>> @@ -324,6 +324,7 @@ enum smu_table_id
> >>>   SMU_TABLE_OVERDRIVE,
> >>>   SMU_TABLE_I2C_COMMANDS,
> >>>   SMU_TABLE_PACE,
> >>> + SMU_TABLE_ECCINFO,
> >>>   SMU_TABLE_COUNT,
> >>>};
> >>>
> >>> @@ -340,6 +341,7 @@ struct smu_table_context
> >>>   void*max_sustainable_clocks;
> >>>   struct smu_bios_boot_up_values  boot_values;
> >>>   void*driver_pptable;
> >>> + void*ecc_table;
> >>>   struct smu_tabletables[SMU_TABLE_COUNT];
> >>>   /*
> >>>* The driver table is just a staging buffer for @@ -1261,6
> >>> +1263,11 @@ struct pptable_funcs {
> >>>*
> >>of SMUBUS table.
> >>>*/
> >>>   int (*send_hbm_bad_pages_num)(struct smu_context *smu,
> >> uint32_t
> >>> size);
> >>> +
> >>> + /**
> >>> +  * @get_ecc_table:  message SMU to get ECC INFO table.
> >>> +  */
> >>> + ssize_t (*get_ecc_info)(struct smu_context *smu, void *table);
> >>>};
> >>>
> >>>typedef enum {
> >>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>> index f835d86cc2f5..5e4ba0e14a91 100644
> >>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>> @@ -190,6 +190,7 @@ static const struct cmn2asic_mapping
> >> aldebaran_table_map[SMU_TABLE_COUNT] = {
> >>>   TAB_MAP(SMU_METRICS),
> >>>   TAB_MAP(DRIVER_SMU_CONFIG),
> >>>   TAB_MAP(I2C_COMMANDS),
> >>> + TAB_MAP(ECCINFO),
> >>>};
> >>>
> >>>static const uint8_t aldebaran_throttler_map[] = { @@ -223,6
> >>> +224,9 @@ static int aldebaran_tables_init(struct smu_context *smu)
> >>>   SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS,
> >> sizeof(SwI2cRequest_t),
> >>>  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> >>>
> >>> + SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO,
> >> sizeof(EccInfoTable_t),
> >>> +PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> >>> +
> >>>   smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t),
> >> GFP_KERNEL);
> >>>   if (!smu_table->metrics_table)
> >>>   return -ENOMEM;
> >>> @@ -235,6 +239,10 @@ static int aldebaran_tables_init(struct
> >>> smu_context
> >> *smu)
> >>>   return -ENOMEM;
> >>>   }
> >>>
> >>> + smu_table->ecc_table = kzalloc(tables[SMU_TABLE_ECCINFO].size,
> >> GFP_KERNEL);
> >>> + if (!smu_table->ecc_table)
> >>> + return -ENOMEM;
> >>> +
> >>>   return 0;
> >>>}
> >>>
> >>> @@ -1765,6 +1773,35 @@ static ssize_t
> >>> aldebaran_get_gpu_metrics(struct
> >> smu_context *smu,
> >>>   return sizeof(struct gpu_metrics_v1_3);
> >>>}
> >>>
> >>> +static ssize_t aldebaran_get_ecc_info(struct smu_context *smu,
> >>> +  void *table)
> >>> +{
> >>> + struct smu_table_context *smu_table = &smu->smu_table;
> >>> + EccInfoTable_t ecc_table;
> >>> + struct ecc_info_per_ch *ecc_info_per_channel = NULL;
> >>> + int i, ret = 0;
> >>> + struct umc_ecc_info *eccinfo = (struct umc_ecc_info *)table;
> >>> +
> >>> + ret = smu_cmn_get_ecc_info_table(smu,
> >>> + &ecc_table);
> >>> + if (ret)
> >>> + return ret

[pull] amdgpu, amdkfd drm-fixes-5.16

2021-11-17 Thread Alex Deucher
Hi Dave, Daniel,

Fixes for 5.16.

The following changes since commit fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf:

  Linux 5.16-rc1 (2021-11-14 13:56:52 -0800)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-5.16-2021-11-17

for you to fetch changes up to 27dfaedc0d321b4ea4e10c53e4679d6911ab17aa:

  drm/amd/amdgpu: fix potential memleak (2021-11-17 23:04:57 -0500)


amd-drm-fixes-5.16-2021-11-17:

amdgpu:
- Better debugging info for SMU msgs
- Better error reporting when adding IP blocks
- Fix UVD powergating regression on CZ
- Clock reporting fix for navi1x
- OLED panel backlight fix
- Fix scaling on VGA/DVI for non-DC display code
- Fix GLFCLK handling for RGP on some APUs
- fix potential memory leak

amdkfd:
- GPU reset fix


Bernard Zhao (1):
  drm/amd/amdgpu: fix potential memleak

Evan Quan (1):
  drm/amd/pm: avoid duplicate powergate/ungate setting

Guchun Chen (1):
  drm/amdgpu: add error print when failing to add IP block(v2)

Lijo Lazar (1):
  drm/amd/pm: Remove artificial freq level on Navi1x

Luben Tuikov (1):
  drm/amd/pm: Enhanced reporting also for a stuck command

Perry Yuan (1):
  drm/amd/pm: add GFXCLK/SCLK clocks level print support for APUs

Roman Li (1):
  drm/amd/display: Fix OLED brightness control on eDP

hongao (1):
  drm/amdgpu: fix set scaling mode Full/Full aspect/Center not works on vga 
and dvi connectors

shaoyunl (1):
  drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered 
again

 drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c  | 36 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  1 +
 .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  |  5 +++
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c  |  3 +-
 drivers/gpu/drm/amd/include/amd_shared.h   |  3 +-
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c| 10 ++
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h|  8 +
 .../drm/amd/pm/swsmu/smu11/cyan_skillfish_ppt.c| 22 +++--
 drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c| 13 +---
 drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c   | 26 
 .../gpu/drm/amd/pm/swsmu/smu13/yellow_carp_ppt.c   | 27 
 .../gpu/drm/amd/pm/swsmu/smu13/yellow_carp_ppt.h   |  1 +
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c |  8 +++--
 15 files changed, 156 insertions(+), 11 deletions(-)


Re: 回复: [PATCH Review 4/4] query umc error info from ecc_table

2021-11-17 Thread Lazar, Lijo




On 11/18/2021 9:29 AM, Yang, Stanley wrote:

[AMD Official Use Only]




-邮件原件-
发件人: Lazar, Lijo 
发送时间: Wednesday, November 17, 2021 7:15 PM
收件人: Yang, Stanley ; amd-
g...@lists.freedesktop.org; Zhang, Hawking ;
Clements, John ; Quan, Evan
; Wang, Yang(Kevin) 
主题: Re: [PATCH Review 4/4] query umc error info from ecc_table



On 11/17/2021 3:41 PM, Stanley.Yang wrote:

if smu support ECCTABLE, driver can message smu to get ecc_table then
query umc error info from ECCTABLE apply pmfw version check to ensure
backward compatibility

Signed-off-by: Stanley.Yang 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 42 ---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  7 ++
   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   | 71 +--



   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  1 +
   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 12 
   .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  4 ++
   6 files changed, 107 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 90f0db3b4f65..6b0f2ba1e420 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -888,6 +888,38 @@ void amdgpu_ras_mca_query_error_status(struct

amdgpu_device *adev,

}
   }

+static void amdgpu_ras_get_ecc_info(struct amdgpu_device *adev,
+struct ras_err_data *err_data) {
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   /*
+* choosing right query method according to
+* whether smu support query error information
+*/
+   if ((ras->smu_version >= SUPPORT_ECCTABLE_SMU_VERSION) &&
+   !smu_get_ecc_info(&adev->smu, (void *)&(ras-
umc_ecc))) {
+


This version check should be in aldebaran_ppt implementation. In general
the callback will check the FW version that supports ECC table for the
corresponding ASIC. It may return ENOTSUPP or similar if the FW version
doesn't support ECC table and that may be checked here. Keeping
smu_version in ras context is not needed.

[Yang, Stanley] I think just check Aldebaran_ppt callback function is not 
enough here, considering this scenario using amdgpu driver with get_ecc_info 
callback function but the pmfw is an older one without ecctable feature. PMFW 
support ecctable since 68.42.0 for Aldebaran.


What I meant is the FW version check code should be part of 
aldebaran_get_ecc_info() function, and it shouldn't be here. That 
function checks if the FW supports ecc_info for aldebaran, if ont 
returns ENOTSUPP.


Similarly for a newer ASIC, the corresponding ppt_func checks 
compatibility. That is one of the purposes of ASIC specific callbacks.


Thanks,
Lijo





+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs-
message_smu_query_ras_error_count)
+   adev->umc.ras_funcs-
message_smu_query_ras_error_count(adev,
+err_data);
+
+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs-
message_smu_query_ras_error_address)
+   adev->umc.ras_funcs-
message_smu_query_ras_error_address(adev, err_data);
+   } else {
+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs->query_ras_error_count)
+   adev->umc.ras_funcs->query_ras_error_count(adev,

err_data);

+
+   /* umc query_ras_error_address is also responsible for

clearing

+* error status
+*/
+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs->query_ras_error_address)
+   adev->umc.ras_funcs-
query_ras_error_address(adev, err_data);
+   }
+}
+
   /* query/inject/cure begin */
   int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
  struct ras_query_if *info)
@@ -901,15 +933,7 @@ int amdgpu_ras_query_error_status(struct
amdgpu_device *adev,

switch (info->head.block) {
case AMDGPU_RAS_BLOCK__UMC:
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_count)
-   adev->umc.ras_funcs->query_ras_error_count(adev,

&err_data);

-   /* umc query_ras_error_address is also responsible for

clearing

-* error status
-*/
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_address)
-   adev->umc.ras_funcs-
query_ras_error_address(adev, &err_data);
+   amdgpu_ras_get_ecc_info(adev, &err_data);
break;
case AMDGPU_RAS_BLOCK__SDMA:
if (adev->sdma.funcs->query_ras_error_count) { diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index bcbf3264d92f..3f0de0cc8403 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/a

Re: 回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table

2021-11-17 Thread Lazar, Lijo




On 11/18/2021 9:07 AM, Yang, Stanley wrote:

[AMD Official Use Only]




-邮件原件-
发件人: Lazar, Lijo 
发送时间: Wednesday, November 17, 2021 7:24 PM
收件人: Yang, Stanley ; amd-
g...@lists.freedesktop.org; Zhang, Hawking ;
Clements, John ; Quan, Evan
; Wang, Yang(Kevin) 
主题: Re: [PATCH Review 3/4] drm/amdgpu: add message smu to get
ecc_table



On 11/17/2021 3:41 PM, Stanley.Yang wrote:

support ECC TABLE message, this table include unc ras error count and
error address

Signed-off-by: Stanley.Yang 
---
   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 
   .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38

+++

   .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 24 
   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
   5 files changed, 74 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 3557f4e7fc30..ea65de0160c3 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -324,6 +324,7 @@ enum smu_table_id
SMU_TABLE_OVERDRIVE,
SMU_TABLE_I2C_COMMANDS,
SMU_TABLE_PACE,
+   SMU_TABLE_ECCINFO,
SMU_TABLE_COUNT,
   };

@@ -340,6 +341,7 @@ struct smu_table_context
void*max_sustainable_clocks;
struct smu_bios_boot_up_values  boot_values;
void*driver_pptable;
+   void*ecc_table;
struct smu_tabletables[SMU_TABLE_COUNT];
/*
 * The driver table is just a staging buffer for @@ -1261,6
+1263,11 @@ struct pptable_funcs {
 *

of SMUBUS table.

 */
int (*send_hbm_bad_pages_num)(struct smu_context *smu,

uint32_t

size);
+
+   /**
+* @get_ecc_table:  message SMU to get ECC INFO table.
+*/
+   ssize_t (*get_ecc_info)(struct smu_context *smu, void *table);
   };

   typedef enum {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index f835d86cc2f5..5e4ba0e14a91 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -190,6 +190,7 @@ static const struct cmn2asic_mapping

aldebaran_table_map[SMU_TABLE_COUNT] = {

TAB_MAP(SMU_METRICS),
TAB_MAP(DRIVER_SMU_CONFIG),
TAB_MAP(I2C_COMMANDS),
+   TAB_MAP(ECCINFO),
   };

   static const uint8_t aldebaran_throttler_map[] = { @@ -223,6 +224,9
@@ static int aldebaran_tables_init(struct smu_context *smu)
SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS,

sizeof(SwI2cRequest_t),

   PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);

+   SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO,

sizeof(EccInfoTable_t),

+  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
+
smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t),

GFP_KERNEL);

if (!smu_table->metrics_table)
return -ENOMEM;
@@ -235,6 +239,10 @@ static int aldebaran_tables_init(struct smu_context

*smu)

return -ENOMEM;
}

+   smu_table->ecc_table = kzalloc(tables[SMU_TABLE_ECCINFO].size,

GFP_KERNEL);

+   if (!smu_table->ecc_table)
+   return -ENOMEM;
+
return 0;
   }

@@ -1765,6 +1773,35 @@ static ssize_t aldebaran_get_gpu_metrics(struct

smu_context *smu,

return sizeof(struct gpu_metrics_v1_3);
   }

+static ssize_t aldebaran_get_ecc_info(struct smu_context *smu,
+void *table)
+{
+   struct smu_table_context *smu_table = &smu->smu_table;
+   EccInfoTable_t ecc_table;
+   struct ecc_info_per_ch *ecc_info_per_channel = NULL;
+   int i, ret = 0;
+   struct umc_ecc_info *eccinfo = (struct umc_ecc_info *)table;
+
+   ret = smu_cmn_get_ecc_info_table(smu,
+   &ecc_table);
+   if (ret)
+   return ret;
+
+   for (i = 0; i < ALDEBARAN_UMC_CHANNEL_NUM; i++) {
+   ecc_info_per_channel = &(eccinfo->ecc[i]);
+   ecc_info_per_channel->ce_count_lo_chip =
+   ecc_table.EccInfo[i].ce_count_lo_chip;
+   ecc_info_per_channel->ce_count_hi_chip =
+   ecc_table.EccInfo[i].ce_count_hi_chip;
+   ecc_info_per_channel->mca_umc_status =
+   ecc_table.EccInfo[i].mca_umc_status;
+   ecc_info_per_channel->mca_umc_addr =
+   ecc_table.EccInfo[i].mca_umc_addr;
+   }
+
+   return ret;
+}
+
   static int aldebaran_mode1_reset(struct smu_context *smu)
   {
u32 smu_version, fatal_err, param;
@@ -1967,6 +2004,7 @@ static const struct pptable_funcs

aldebaran_ppt_funcs = {

.i2c_init = aldebaran_i2c_control_init,
.i2c_fini = aldebaran_i2c_control_fini,
.send

回复: [PATCH Review 4/4] query umc error info from ecc_table

2021-11-17 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Wednesday, November 17, 2021 7:15 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Quan, Evan
> ; Wang, Yang(Kevin) 
> 主题: Re: [PATCH Review 4/4] query umc error info from ecc_table
> 
> 
> 
> On 11/17/2021 3:41 PM, Stanley.Yang wrote:
> > if smu support ECCTABLE, driver can message smu to get ecc_table then
> > query umc error info from ECCTABLE apply pmfw version check to ensure
> > backward compatibility
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 42 ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  7 ++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   | 71 +--
> 
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  1 +
> >   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 12 
> >   .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  4 ++
> >   6 files changed, 107 insertions(+), 30 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 90f0db3b4f65..6b0f2ba1e420 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -888,6 +888,38 @@ void amdgpu_ras_mca_query_error_status(struct
> amdgpu_device *adev,
> > }
> >   }
> >
> > +static void amdgpu_ras_get_ecc_info(struct amdgpu_device *adev,
> > +struct ras_err_data *err_data) {
> > +   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> > +
> > +   /*
> > +* choosing right query method according to
> > +* whether smu support query error information
> > +*/
> > +   if ((ras->smu_version >= SUPPORT_ECCTABLE_SMU_VERSION) &&
> > +   !smu_get_ecc_info(&adev->smu, (void *)&(ras-
> >umc_ecc))) {
> > +
> 
> This version check should be in aldebaran_ppt implementation. In general
> the callback will check the FW version that supports ECC table for the
> corresponding ASIC. It may return ENOTSUPP or similar if the FW version
> doesn't support ECC table and that may be checked here. Keeping
> smu_version in ras context is not needed.
[Yang, Stanley] I think just check Aldebaran_ppt callback function is not 
enough here, considering this scenario using amdgpu driver with get_ecc_info 
callback function but the pmfw is an older one without ecctable feature. PMFW 
support ecctable since 68.42.0 for Aldebaran.

> 
> > +   if (adev->umc.ras_funcs &&
> > +   adev->umc.ras_funcs-
> >message_smu_query_ras_error_count)
> > +   adev->umc.ras_funcs-
> >message_smu_query_ras_error_count(adev,
> > +err_data);
> > +
> > +   if (adev->umc.ras_funcs &&
> > +   adev->umc.ras_funcs-
> >message_smu_query_ras_error_address)
> > +   adev->umc.ras_funcs-
> >message_smu_query_ras_error_address(adev, err_data);
> > +   } else {
> > +   if (adev->umc.ras_funcs &&
> > +   adev->umc.ras_funcs->query_ras_error_count)
> > +   adev->umc.ras_funcs->query_ras_error_count(adev,
> err_data);
> > +
> > +   /* umc query_ras_error_address is also responsible for
> clearing
> > +* error status
> > +*/
> > +   if (adev->umc.ras_funcs &&
> > +   adev->umc.ras_funcs->query_ras_error_address)
> > +   adev->umc.ras_funcs-
> >query_ras_error_address(adev, err_data);
> > +   }
> > +}
> > +
> >   /* query/inject/cure begin */
> >   int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
> >   struct ras_query_if *info)
> > @@ -901,15 +933,7 @@ int amdgpu_ras_query_error_status(struct
> > amdgpu_device *adev,
> >
> > switch (info->head.block) {
> > case AMDGPU_RAS_BLOCK__UMC:
> > -   if (adev->umc.ras_funcs &&
> > -   adev->umc.ras_funcs->query_ras_error_count)
> > -   adev->umc.ras_funcs->query_ras_error_count(adev,
> &err_data);
> > -   /* umc query_ras_error_address is also responsible for
> clearing
> > -* error status
> > -*/
> > -   if (adev->umc.ras_funcs &&
> > -   adev->umc.ras_funcs->query_ras_error_address)
> > -   adev->umc.ras_funcs-
> >query_ras_error_address(adev, &err_data);
> > +   amdgpu_ras_get_ecc_info(adev, &err_data);
> > break;
> > case AMDGPU_RAS_BLOCK__SDMA:
> > if (adev->sdma.funcs->query_ras_error_count) { diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index bcbf3264d92f..3f0de0cc8403 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -322,6 +322,12 @@ struct ras_common_if {
> >
> >   #define MAX_UMC_CHANNEL_NUM 32
> >
> > +/*
> > + * SMU support ECCTABLE since version 68.42.0,
> > + * use this to decide query umc error info 

[PATCH 1/2] drm/amdgpu: Generalize KFD dmabuf import

2021-11-17 Thread Felix Kuehling
Use proper amdgpu_gem_prime_import function to handle all kinds of
imports. Remember the dmabuf reference to enable proper multi-GPU
attachment to multiple VMs without erroneously re-exporting the
underlying BO multiple times.

Signed-off-by: Felix Kuehling 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 ---
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 90b985436878..d53d19b9d6dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2000,30 +2000,34 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct 
amdgpu_device *adev,
struct amdgpu_bo *bo;
int ret;
 
-   if (dma_buf->ops != &amdgpu_dmabuf_ops)
-   /* Can't handle non-graphics buffers */
-   return -EINVAL;
-
-   obj = dma_buf->priv;
-   if (drm_to_adev(obj->dev) != adev)
-   /* Can't handle buffers from other devices */
-   return -EINVAL;
+   obj = amdgpu_gem_prime_import(adev_to_drm(adev), dma_buf);
+   if (IS_ERR(obj))
+   return PTR_ERR(obj);
+   if (obj->import_attach) {
+   /* Import takes an extra reference if it creates a dynamic
+* dmabuf attachment. Release that reference to avoid
+* leaking it. FIXME: I think this is a bug.
+*/
+   dma_buf_put(dma_buf);
+   }
 
bo = gem_to_amdgpu_bo(obj);
if (!(bo->preferred_domains & (AMDGPU_GEM_DOMAIN_VRAM |
-   AMDGPU_GEM_DOMAIN_GTT)))
+   AMDGPU_GEM_DOMAIN_GTT))) {
/* Only VRAM and GTT BOs are supported */
-   return -EINVAL;
+   ret = -EINVAL;
+   goto err_put_obj;
+   }
 
*mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL);
-   if (!*mem)
-   return -ENOMEM;
+   if (!*mem) {
+   ret = -ENOMEM;
+   goto err_put_obj;
+   }
 
ret = drm_vma_node_allow(&obj->vma_node, drm_priv);
-   if (ret) {
-   kfree(mem);
-   return ret;
-   }
+   if (ret)
+   goto err_free_mem;
 
if (size)
*size = amdgpu_bo_size(bo);
@@ -2040,7 +2044,8 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct 
amdgpu_device *adev,
| KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE
| KFD_IOC_ALLOC_MEM_FLAGS_EXECUTABLE;
 
-   drm_gem_object_get(&bo->tbo.base);
+   get_dma_buf(dma_buf);
+   (*mem)->dmabuf = dma_buf;
(*mem)->bo = bo;
(*mem)->va = va;
(*mem)->domain = (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM) ?
@@ -2052,6 +2057,12 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct 
amdgpu_device *adev,
(*mem)->is_imported = true;
 
return 0;
+
+err_free_mem:
+   kfree(mem);
+err_put_obj:
+   drm_gem_object_put(obj);
+   return ret;
 }
 
 /* Evict a userptr BO by stopping the queues if necessary
-- 
2.32.0



[PATCH 2/2] drm/amdkfd: Implement DMA buf fd export for RDMA

2021-11-17 Thread Felix Kuehling
Exports a DMA buf fd of a given KFD buffer handle. This is intended for
the new upstreamable RDMA solution coming to UCX and libfabric.

The corresponding user mode change (Thunk API and kfdtest) is here:
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/commits/fxkamd/dmabuf

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  2 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +++
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 55 +++
 include/uapi/linux/kfd_ioctl.h| 14 -
 4 files changed, 104 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index fcbc8a9c9e06..840de82460a3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -294,6 +294,8 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device 
*adev,
  uint64_t va, void *drm_priv,
  struct kgd_mem **mem, uint64_t *size,
  uint64_t *mmap_offset);
+int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem,
+ struct dma_buf **dmabuf);
 int amdgpu_amdkfd_get_tile_config(struct amdgpu_device *adev,
struct tile_config *config);
 void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index d53d19b9d6dc..9f57e5091fa8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -641,6 +641,21 @@ kfd_mem_attach_userptr(struct amdgpu_device *adev, struct 
kgd_mem *mem,
return 0;
 }
 
+static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
+{
+   if (!mem->dmabuf) {
+   struct dma_buf *ret = amdgpu_gem_prime_export(
+   &mem->bo->tbo.base,
+   mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
+   DRM_RDWR : 0);
+   if (IS_ERR(ret))
+   return PTR_ERR(ret);
+   mem->dmabuf = ret;
+   }
+
+   return 0;
+}
+
 static int
 kfd_mem_attach_dmabuf(struct amdgpu_device *adev, struct kgd_mem *mem,
  struct amdgpu_bo **bo)
@@ -648,16 +663,9 @@ kfd_mem_attach_dmabuf(struct amdgpu_device *adev, struct 
kgd_mem *mem,
struct drm_gem_object *gobj;
int ret;
 
-   if (!mem->dmabuf) {
-   mem->dmabuf = amdgpu_gem_prime_export(&mem->bo->tbo.base,
-   mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-   DRM_RDWR : 0);
-   if (IS_ERR(mem->dmabuf)) {
-   ret = PTR_ERR(mem->dmabuf);
-   mem->dmabuf = NULL;
-   return ret;
-   }
-   }
+   ret = kfd_mem_export_dmabuf(mem);
+   if (ret)
+   return ret;
 
gobj = amdgpu_gem_prime_import(adev_to_drm(adev), mem->dmabuf);
if (IS_ERR(gobj))
@@ -2065,6 +2073,23 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct 
amdgpu_device *adev,
return ret;
 }
 
+int amdgpu_amdkfd_gpuvm_export_dmabuf(struct kgd_mem *mem,
+ struct dma_buf **dma_buf)
+{
+   int ret;
+
+   mutex_lock(&mem->lock);
+   ret = kfd_mem_export_dmabuf(mem);
+   if (ret)
+   goto out;
+
+   get_dma_buf(mem->dmabuf);
+   *dma_buf = mem->dmabuf;
+out:
+   mutex_unlock(&mem->lock);
+   return ret;
+}
+
 /* Evict a userptr BO by stopping the queues if necessary
  *
  * Runs in MMU notifier, may be in RECLAIM_FS context. This means it
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 4bfc0c8ab764..ddbc28951ac1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1787,6 +1787,58 @@ static int kfd_ioctl_import_dmabuf(struct file *filep,
return r;
 }
 
+static int kfd_ioctl_export_dmabuf(struct file *filep,
+  struct kfd_process *p, void *data)
+{
+   struct kfd_ioctl_export_dmabuf_args *args = data;
+   struct kfd_process_device *pdd;
+   struct dma_buf *dmabuf;
+   struct kfd_dev *dev;
+   void *mem;
+   int ret = 0;
+
+   dev = kfd_device_by_id(GET_GPU_ID(args->handle));
+   if (!dev)
+   return -EINVAL;
+
+   mutex_lock(&p->mutex);
+
+   pdd = kfd_get_process_device_data(dev, p);
+   if (!pdd) {
+   ret = -EINVAL;
+   goto err_unlock;
+   }
+
+   mem = kfd_process_device_translate_handle(pdd,
+   GET_IDR_HANDLE(args->handle));
+   if (!mem) {
+   ret

回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table

2021-11-17 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Wednesday, November 17, 2021 7:24 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Quan, Evan
> ; Wang, Yang(Kevin) 
> 主题: Re: [PATCH Review 3/4] drm/amdgpu: add message smu to get
> ecc_table
> 
> 
> 
> On 11/17/2021 3:41 PM, Stanley.Yang wrote:
> > support ECC TABLE message, this table include unc ras error count and
> > error address
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 
> >   .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38
> +++
> >   .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
> >   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 24 
> >   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
> >   5 files changed, 74 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > index 3557f4e7fc30..ea65de0160c3 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > @@ -324,6 +324,7 @@ enum smu_table_id
> > SMU_TABLE_OVERDRIVE,
> > SMU_TABLE_I2C_COMMANDS,
> > SMU_TABLE_PACE,
> > +   SMU_TABLE_ECCINFO,
> > SMU_TABLE_COUNT,
> >   };
> >
> > @@ -340,6 +341,7 @@ struct smu_table_context
> > void*max_sustainable_clocks;
> > struct smu_bios_boot_up_values  boot_values;
> > void*driver_pptable;
> > +   void*ecc_table;
> > struct smu_tabletables[SMU_TABLE_COUNT];
> > /*
> >  * The driver table is just a staging buffer for @@ -1261,6
> > +1263,11 @@ struct pptable_funcs {
> >  *
>   of SMUBUS table.
> >  */
> > int (*send_hbm_bad_pages_num)(struct smu_context *smu,
> uint32_t
> > size);
> > +
> > +   /**
> > +* @get_ecc_table:  message SMU to get ECC INFO table.
> > +*/
> > +   ssize_t (*get_ecc_info)(struct smu_context *smu, void *table);
> >   };
> >
> >   typedef enum {
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > index f835d86cc2f5..5e4ba0e14a91 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > @@ -190,6 +190,7 @@ static const struct cmn2asic_mapping
> aldebaran_table_map[SMU_TABLE_COUNT] = {
> > TAB_MAP(SMU_METRICS),
> > TAB_MAP(DRIVER_SMU_CONFIG),
> > TAB_MAP(I2C_COMMANDS),
> > +   TAB_MAP(ECCINFO),
> >   };
> >
> >   static const uint8_t aldebaran_throttler_map[] = { @@ -223,6 +224,9
> > @@ static int aldebaran_tables_init(struct smu_context *smu)
> > SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS,
> sizeof(SwI2cRequest_t),
> >PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> >
> > +   SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO,
> sizeof(EccInfoTable_t),
> > +  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > +
> > smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t),
> GFP_KERNEL);
> > if (!smu_table->metrics_table)
> > return -ENOMEM;
> > @@ -235,6 +239,10 @@ static int aldebaran_tables_init(struct smu_context
> *smu)
> > return -ENOMEM;
> > }
> >
> > +   smu_table->ecc_table = kzalloc(tables[SMU_TABLE_ECCINFO].size,
> GFP_KERNEL);
> > +   if (!smu_table->ecc_table)
> > +   return -ENOMEM;
> > +
> > return 0;
> >   }
> >
> > @@ -1765,6 +1773,35 @@ static ssize_t aldebaran_get_gpu_metrics(struct
> smu_context *smu,
> > return sizeof(struct gpu_metrics_v1_3);
> >   }
> >
> > +static ssize_t aldebaran_get_ecc_info(struct smu_context *smu,
> > +void *table)
> > +{
> > +   struct smu_table_context *smu_table = &smu->smu_table;
> > +   EccInfoTable_t ecc_table;
> > +   struct ecc_info_per_ch *ecc_info_per_channel = NULL;
> > +   int i, ret = 0;
> > +   struct umc_ecc_info *eccinfo = (struct umc_ecc_info *)table;
> > +
> > +   ret = smu_cmn_get_ecc_info_table(smu,
> > +   &ecc_table);
> > +   if (ret)
> > +   return ret;
> > +
> > +   for (i = 0; i < ALDEBARAN_UMC_CHANNEL_NUM; i++) {
> > +   ecc_info_per_channel = &(eccinfo->ecc[i]);
> > +   ecc_info_per_channel->ce_count_lo_chip =
> > +   ecc_table.EccInfo[i].ce_count_lo_chip;
> > +   ecc_info_per_channel->ce_count_hi_chip =
> > +   ecc_table.EccInfo[i].ce_count_hi_chip;
> > +   ecc_info_per_channel->mca_umc_status =
> > +   ecc_table.EccInfo[i].mca_umc_status;
> > +   ecc_info_per_channel->mca_umc_addr =
> > +   ecc_table.EccInfo[i].mca_umc_addr;
> > +   }
> > +
> > +   return ret;
> > +}
> > +
> >   static int aldebaran_mode1_reset(struct smu_context *smu)
> >   {
> > u32 smu_version, fatal_err, param;
> > @@ -1967,6 +20

回复: [PATCH Review 2/4] drm/amdgpu: add new query interface for umc block

2021-11-17 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Wednesday, November 17, 2021 7:36 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Quan, Evan
> ; Wang, Yang(Kevin) 
> 主题: Re: [PATCH Review 2/4] drm/amdgpu: add new query interface for
> umc block
> 
> 
> 
> On 11/17/2021 3:41 PM, Stanley.Yang wrote:
> > add message smu to query error information
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  16 +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |   4 +
> >   drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   | 161
> 
> >   3 files changed, 181 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index cdd0010a5389..bcbf3264d92f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -320,6 +320,19 @@ struct ras_common_if {
> > char name[32];
> >   };
> >
> > +#define MAX_UMC_CHANNEL_NUM 32
> > +
> > +struct ecc_info_per_ch {
> > +   uint16_t ce_count_lo_chip;
> > +   uint16_t ce_count_hi_chip;
> > +   uint64_t mca_umc_status;
> > +   uint64_t mca_umc_addr;
> > +};
> > +
> > +struct umc_ecc_info {
> > +   struct ecc_info_per_ch ecc[MAX_UMC_CHANNEL_NUM]; };
> > +
> >   struct amdgpu_ras {
> > /* ras infrastructure */
> > /* for ras itself. */
> > @@ -359,6 +372,9 @@ struct amdgpu_ras {
> > struct delayed_work ras_counte_delay_work;
> > atomic_t ras_ue_count;
> > atomic_t ras_ce_count;
> > +
> > +   /* record umc error info queried from smu */
> > +   struct umc_ecc_info umc_ecc;
> >   };
> >
> >   struct ras_fs_data {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > index 1f5fe2315236..7aa9b21eb906 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > @@ -49,6 +49,10 @@ struct amdgpu_umc_ras_funcs {
> > void (*query_ras_error_address)(struct amdgpu_device *adev,
> > void *ras_error_status);
> > bool (*query_ras_poison_mode)(struct amdgpu_device *adev);
> > +   void (*message_smu_query_ras_error_count)(struct
> amdgpu_device *adev,
> > + void *ras_error_status);
> > +   void (*message_smu_query_ras_error_address)(struct
> amdgpu_device *adev,
> > +   void *ras_error_status);
> 
> Maybe rename message_smu to ecc_info. These methods fetch the error
> from umc_ecc_info table. They don't deal with smu or care about how the
> information gets filled. As long as ecc_info_table is filled, they could get 
> the
> info.
[Yang, Stanley] yeah, it seems rename message_smu to ecc_info is better since 
ecc_table has been update before this call.

> 
> Thanks,
> Lijo
> 
> >   };
> >
> >   struct amdgpu_umc_funcs {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > index f7ec3fe134e5..cd96e8b734cb 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > @@ -50,6 +50,165 @@ static inline uint32_t
> get_umc_v6_7_reg_offset(struct amdgpu_device *adev,
> > return adev->umc.channel_offs * ch_inst + UMC_V6_7_INST_DIST *
> umc_inst;
> >   }
> >
> > +static inline uint32_t get_umc_v6_7_channel_index(struct
> amdgpu_device *adev,
> > + uint32_t umc_inst,
> > + uint32_t ch_inst)
> > +{
> > +   return adev->umc.channel_idx_tbl[umc_inst *
> > +adev->umc.channel_inst_num + ch_inst]; }
> > +
> > +static void
> umc_v6_7_message_smu_query_correctable_error_count(struct
> amdgpu_device *adev,
> > +  uint32_t channel_index,
> > +  unsigned long *error_count)
> > +{
> > +   uint32_t ecc_err_cnt;
> > +   uint64_t mc_umc_status;
> > +   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> > +
> > +   /*
> > +* select the lower chip and check the error count
> > +* skip add error count, calc error counter only from mca_umc_status
> > +*/
> > +   ecc_err_cnt = ras->umc_ecc.ecc[channel_index].ce_count_lo_chip;
> > +
> > +   /*
> > +* select the higher chip and check the err counter
> > +* skip add error count, calc error counter only from mca_umc_status
> > +*/
> > +   ecc_err_cnt = ras->umc_ecc.ecc[channel_index].ce_count_hi_chip;
> > +
> > +   /* check for SRAM correctable error
> > + MCUMC_STATUS is a 64 bit register */
> > +   mc_umc_status = ras-
> >umc_ecc.ecc[channel_index].mca_umc_status;
> > +   if (REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1 &&
> > +   REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 1)
> > +   *error_count += 1;
> > +}
> > +
> > +static void
> umc_v6_7_message_smu_qu

Re: [PATCH v2 1/1] drm/amdgpu: Fix MMIO HDP flush on SRIOV

2021-11-17 Thread Felix Kuehling

On 2021-11-10 11:34 a.m., Felix Kuehling wrote:

Am 2021-11-10 um 11:11 a.m. schrieb Lazar, Lijo:

[Public]



(... && !amdgpu_sriov_vf(adev))

This kind of closes the door for all versions. My thought was - having
it in the same function provides a logical grouping for how it's
handled for different cases - VF vs non-VF - for a particular IP version.

Except that's not really true. There is still common code (setting
adev->rmmio_remap.bus_addr) for handling MMIO remapping on VFs in
soc15.c and nv.c. I'm not comfortable with duplicating all of that
across all the IP version-specific files.

I also think it's very unlikely that a small IP version bump is going to
change this logic. So I'd rather prefer to keep the code more general
and conservatively correct.


Hi Lijo,

The virtualization team has finished testing this patch and wants me to 
submit it. Do you insist I rework the patch to move all the 
adev->rmmio_remap fixups for virtualization into the nbio 
version-specific remap_hdp_registers functions?


Regards,
  Felix




Regards,
   Felix



Thanks,
Lijo

*From:* Kuehling, Felix 
*Sent:* Wednesday, November 10, 2021 9:27:22 PM
*To:* Lazar, Lijo ; amd-gfx@lists.freedesktop.org

*Cc:* Zhang, Bokun 
*Subject:* Re: [PATCH v2 1/1] drm/amdgpu: Fix MMIO HDP flush on SRIOV
  
Am 2021-11-10 um 4:14 a.m. schrieb Lazar, Lijo:

On 11/10/2021 8:00 AM, Felix Kuehling wrote:

Disable HDP register remapping on SRIOV and set rmmio_remap.reg_offset
to the fixed address of the VF register for hdp_v*_flush_hdp.

Signed-off-by: Felix Kuehling 
---
   drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c | 4 
   drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c | 4 
   drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c | 4 +++-
   drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c | 4 
   drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 4 +++-
   drivers/gpu/drm/amd/amdgpu/nv.c    | 8 +---
   drivers/gpu/drm/amd/amdgpu/soc15.c | 8 +---
   7 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c
b/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c
index 4ecd2b5808ce..ee7cab37dfd5 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c
@@ -359,6 +359,10 @@ static void nbio_v2_3_init_registers(struct
amdgpu_device *adev)
     if (def != data)
   WREG32_PCIE(smnPCIE_CONFIG_CNTL, data);
+
+    if (amdgpu_sriov_vf(adev))
+    adev->rmmio_remap.reg_offset = SOC15_REG_OFFSET(NBIO, 0,
+    mmBIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL) << 2;

Wouldn't it be better to do this assignment inside
remap_hdp_registers()? Return with a comment saying no remap is done
for VFs. That looks easier to manage per IP version.

I was considering that. I felt it was clearer not to have that hidden
side effect in remap_hdp_registers and to have the explicit condition
(... &&  !amdgpu_sriov_vf(adev)) around the remap_hdp_registers call in
soc15/nv_common_hw_init.

Regards,
   Felix



Thanks,
Lijo


   }
     #define NAVI10_PCIE__LC_L0S_INACTIVITY_DEFAULT    0x
// off by default, no gains over L1
diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c
b/drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c
index 0d2d629e2d6a..4bbacf1be25a 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v6_1.c
@@ -276,6 +276,10 @@ static void nbio_v6_1_init_registers(struct
amdgpu_device *adev)
     if (def != data)
   WREG32_PCIE(smnPCIE_CI_CNTL, data);
+
+    if (amdgpu_sriov_vf(adev))
+    adev->rmmio_remap.reg_offset = SOC15_REG_OFFSET(NBIO, 0,
+    mmBIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL) << 2;
   }
     static void nbio_v6_1_program_ltr(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c
b/drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c
index 3c00666a13e1..37a4039fdfc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_0.c
@@ -273,7 +273,9 @@ const struct nbio_hdp_flush_reg
nbio_v7_0_hdp_flush_reg = {
     static void nbio_v7_0_init_registers(struct amdgpu_device *adev)
   {
-
+    if (amdgpu_sriov_vf(adev))
+    adev->rmmio_remap.reg_offset =
+    SOC15_REG_OFFSET(NBIO, 0,
mmHDP_MEM_COHERENCY_FLUSH_CNTL) << 2;
   }
     const struct amdgpu_nbio_funcs nbio_v7_0_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c
b/drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c
index 8f2a315e7c73..3444332ea110 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_2.c
@@ -371,6 +371,10 @@ static void nbio_v7_2_init_registers(struct
amdgpu_device *adev)
   if (def != data)
   WREG32_PCIE_PORT(SOC15_REG_OFFSET(NBIO, 0,
regPCIE_CONFIG_CNTL), data);
   }
+
+    if (amdgpu_sriov_vf(adev))
+    adev->rmmio_remap.reg_offset = SOC15_REG_OFFSET(NBIO, 0,
+    regBIF_BX_PF0_HDP_MEM_COHERENCY_FLUSH_CNTL) << 2;
   }
     const 

Re: [PATCH 3/3] drm/amdkfd: simplify drain retry fault

2021-11-17 Thread Felix Kuehling



On 2021-11-16 10:43 p.m., Philip Yang wrote:

unmap range always set svms->drain_pagefaults flag to simplify both
parent range and child range unmap. Deferred list work takes mmap write
lock to read and clear svms->drain_pagefaults, to serialize with unmap
callback.

Add atomic flag svms->draining_faults, if svms->draining_faults is set,
page fault handle ignore the retry fault, to speed up interrupt handling.
Having both svms->drain_pagefaults and svms->draining_faults is 
confusing. Do we really need both?


Regards,
  Felix



Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  1 +
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 24 ++--
  2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 1d3f012bcd2a..4e4640b731e1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -767,6 +767,7 @@ struct svm_range_list {
spinlock_t  deferred_list_lock;
atomic_tevicted_ranges;
booldrain_pagefaults;
+   atomic_tdraining_faults;
struct delayed_work restore_work;
DECLARE_BITMAP(bitmap_supported, MAX_GPU_INSTANCE);
struct task_struct  *faulting_task;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 3eb0a9491755..d332462bf9d3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1962,6 +1962,7 @@ void svm_range_drain_retry_fault(struct svm_range_list 
*svms)
  
  	p = container_of(svms, struct kfd_process, svms);
  
+	atomic_set(&svms->draining_faults, 1);

for_each_set_bit(i, svms->bitmap_supported, p->n_pdds) {
pdd = p->pdds[i];
if (!pdd)
@@ -1975,6 +1976,7 @@ void svm_range_drain_retry_fault(struct svm_range_list 
*svms)
flush_work(&adev->irq.ih1_work);
pr_debug("drain retry fault gpu %d svms 0x%p done\n", i, svms);
}
+   atomic_set(&svms->draining_faults, 0);
  }
  
  static void svm_range_deferred_list_work(struct work_struct *work)

@@ -2002,6 +2004,7 @@ static void svm_range_deferred_list_work(struct 
work_struct *work)
 * mmap write lock to serialize with munmap notifiers.
 */
if (unlikely(READ_ONCE(svms->drain_pagefaults))) {
+   atomic_set(&svms->draining_faults, 1);
WRITE_ONCE(svms->drain_pagefaults, false);
mmap_write_unlock(mm);
svm_range_drain_retry_fault(svms);
@@ -2049,12 +2052,6 @@ svm_range_add_list_work(struct svm_range_list *svms, 
struct svm_range *prange,
struct mm_struct *mm, enum svm_work_list_ops op)
  {
spin_lock(&svms->deferred_list_lock);
-   /* Make sure pending page faults are drained in the deferred worker
-* before the range is freed to avoid straggler interrupts on
-* unmapped memory causing "phantom faults".
-*/
-   if (op == SVM_OP_UNMAP_RANGE)
-   svms->drain_pagefaults = true;
/* if prange is on the deferred list */
if (!list_empty(&prange->deferred_list)) {
pr_debug("update exist prange 0x%p work op %d\n", prange, op);
@@ -2133,6 +2130,13 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct 
svm_range *prange,
pr_debug("svms 0x%p prange 0x%p [0x%lx 0x%lx] [0x%lx 0x%lx]\n", svms,
 prange, prange->start, prange->last, start, last);
  
+	/* Make sure pending page faults are drained in the deferred worker

+* before the range is freed to avoid straggler interrupts on
+* unmapped memory causing "phantom faults".
+*/
+   pr_debug("set range drain_pagefaults true\n");
+   WRITE_ONCE(svms->drain_pagefaults, true);
+
unmap_parent = start <= prange->start && last >= prange->last;
  
  	list_for_each_entry(pchild, &prange->child_list, child_list) {

@@ -2595,6 +2599,13 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,
mm = p->mm;
mmap_write_lock(mm);
  
+	if (!!atomic_read(&svms->draining_faults) ||

+   READ_ONCE(svms->drain_pagefaults)) {
+   pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
+   mmap_write_downgrade(mm);
+   goto out_unlock_mm;
+   }
+
vma = find_vma(p->mm, addr << PAGE_SHIFT);
if (!vma || (addr << PAGE_SHIFT) < vma->vm_start) {
pr_debug("VMA not found for address 0x%llx\n", addr);
@@ -2732,6 +2743,7 @@ int svm_range_list_init(struct kfd_process *p)
mutex_init(&svms->lock);
INIT_LIST_HEAD(&svms->list);
atomic_set(&svms->evicted_ranges, 0);
+   atomic_set(&svms->draining_faults, 0);
INIT_DELAYED_WORK(&svms->restore_work, svm_range_restore_

Re: [PATCH 2/3] drm/amdkfd: handle VMA remove race

2021-11-17 Thread Felix Kuehling

On 2021-11-16 10:43 p.m., Philip Yang wrote:

VMA may be removed before unmap notifier callback, restore pages take
mmap write lock to lookup VMA to avoid race,


The old code looked up the VMA after taking the mmap lock (either read 
or write) and kept holding the lock afterwards. I think even with your 
new code it's possible that the VMA disappears before you take the lock 
the first time, so always taking the write lock only reduces the time 
window in which things can go wrong, but it doesn't remove the race.


I still struggle to understand the race you're trying to fix. The only 
time the svm_restore_pages can see that the VMA is gone AND the prange 
is gone is after the deferred worker has removed the prange. But the 
fault draining in the deferred worker should prevent us from ever seeing 
stale faults in that situation. That means, if no prange is found and no 
VMA is found, it's definitely an application bug.


The only possible race is in the case where the prange still exists but 
the VMA is gone (needed by svm_fault_allowed). We can treat that as a 
special case where we just return success because we know that we're 
handling a stale fault for a VMA that's in the middle of being unmapped. 
The fact that the prange still existed means that there once was a valid 
mapping at the address but the deferred worker just hasn't had a chance 
to clean it up yet.


One more comment inline.



  and then create unregister
new range and check VMA access permission, then downgrade to take mmap
read lock to recover fault. Refactor code to avoid duplicate VMA lookup.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 65 ++--
  1 file changed, 24 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index c1f367934428..3eb0a9491755 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2329,20 +2329,13 @@ svm_range_best_restore_location(struct svm_range 
*prange,
  }
  
  static int

-svm_range_get_range_boundaries(struct kfd_process *p, int64_t addr,
-  unsigned long *start, unsigned long *last,
-  bool *is_heap_stack)
+svm_range_get_range_boundaries(struct kfd_process *p, struct vm_area_struct 
*vma,
+  int64_t addr, unsigned long *start,
+  unsigned long *last, bool *is_heap_stack)
  {
-   struct vm_area_struct *vma;
struct interval_tree_node *node;
unsigned long start_limit, end_limit;
  
-	vma = find_vma(p->mm, addr << PAGE_SHIFT);

-   if (!vma || (addr << PAGE_SHIFT) < vma->vm_start) {
-   pr_debug("VMA does not exist in address [0x%llx]\n", addr);
-   return -EFAULT;
-   }
-
*is_heap_stack = (vma->vm_start <= vma->vm_mm->brk &&
  vma->vm_end >= vma->vm_mm->start_brk) ||
 (vma->vm_start <= vma->vm_mm->start_stack &&
@@ -2437,9 +2430,10 @@ svm_range_check_vm_userptr(struct kfd_process *p, 
uint64_t start, uint64_t last,
  
  static struct

  svm_range *svm_range_create_unregistered_range(struct amdgpu_device *adev,
-   struct kfd_process *p,
-   struct mm_struct *mm,
-   int64_t addr)
+  struct kfd_process *p,
+  struct mm_struct *mm,
+  struct vm_area_struct *vma,
+  int64_t addr)
  {
struct svm_range *prange = NULL;
unsigned long start, last;
@@ -2449,7 +2443,7 @@ svm_range *svm_range_create_unregistered_range(struct 
amdgpu_device *adev,
uint64_t bo_l = 0;
int r;
  
-	if (svm_range_get_range_boundaries(p, addr, &start, &last,

+   if (svm_range_get_range_boundaries(p, vma, addr, &start, &last,
   &is_heap_stack))
return NULL;
  
@@ -2552,20 +2546,13 @@ svm_range_count_fault(struct amdgpu_device *adev, struct kfd_process *p,

  }
  
  static bool

-svm_fault_allowed(struct mm_struct *mm, uint64_t addr, bool write_fault)
+svm_fault_allowed(struct vm_area_struct *vma, bool write_fault)
  {
unsigned long requested = VM_READ;
-   struct vm_area_struct *vma;
  
  	if (write_fault)

requested |= VM_WRITE;
  
-	vma = find_vma(mm, addr << PAGE_SHIFT);

-   if (!vma || (addr << PAGE_SHIFT) < vma->vm_start) {
-   pr_debug("address 0x%llx VMA is removed\n", addr);
-   return true;
-   }
-
pr_debug("requested 0x%lx, vma permission flags 0x%lx\n", requested,
vma->vm_flags);
return (vma->vm_flags & requested) == requested;
@@ -2582,7 +2569,7 @@ svm_range_resto

Re: [PATCH 1/3] drm/amdkfd: process exit and retry fault race

2021-11-17 Thread Felix Kuehling

On 2021-11-16 10:43 p.m., Philip Yang wrote:

kfd process mmu release notifier callback drain retry fault to ensure no
retry fault comes after removing kfd process from the hash table,
otherwise svm page fault handler will fail to recover the fault and dump
GPU vm fault log.

Drain retry fault needs flush restore page fault work to wait for
the last fault is handled because IH dispatch increase rptr first and
then calls restore_pages, so restore pages may still handle the last
fault but amdgpu_ih_has_checkpoint_processed return true.


This fixes the problem, but it will result in waiting longer than 
necessary because the worker only finishes when the IH ring is empty.





restore pages can not call mmget because mmput may call mmu notifier
release to cause deadlock.


See my comment inline.




Refactor deferred list work to call mmget and take mmap write lock to
handle all ranges, to avoid mm is gone while inserting mmu notifier.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_process.c |  6 +++
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 69 
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 +
  3 files changed, 41 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index d4c8a6948a9f..8b4b045d5c92 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1143,6 +1143,12 @@ static void kfd_process_notifier_release(struct 
mmu_notifier *mn,
if (WARN_ON(p->mm != mm))
return;
  
+	/*

+* Ensure no retry fault comes in afterwards, as page fault handler will
+* not find kfd process and take mm lock to recover fault.
+*/
+   svm_range_drain_retry_fault(&p->svms);
+
mutex_lock(&kfd_processes_mutex);
hash_del_rcu(&p->kfd_processes);
mutex_unlock(&kfd_processes_mutex);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 88360f23eb61..c1f367934428 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1953,9 +1953,10 @@ svm_range_handle_list_op(struct svm_range_list *svms, 
struct svm_range *prange)
}
  }
  
-static void svm_range_drain_retry_fault(struct svm_range_list *svms)

+void svm_range_drain_retry_fault(struct svm_range_list *svms)
  {
struct kfd_process_device *pdd;
+   struct amdgpu_device *adev;
struct kfd_process *p;
uint32_t i;
  
@@ -1967,9 +1968,11 @@ static void svm_range_drain_retry_fault(struct svm_range_list *svms)

continue;
  
  		pr_debug("drain retry fault gpu %d svms %p\n", i, svms);

+   adev = pdd->dev->adev;
+   amdgpu_ih_wait_on_checkpoint_process(adev, &adev->irq.ih1);
  
-		amdgpu_ih_wait_on_checkpoint_process(pdd->dev->adev,

-&pdd->dev->adev->irq.ih1);
+   /* Wait for the last page fault is handled */
+   flush_work(&adev->irq.ih1_work);
pr_debug("drain retry fault gpu %d svms 0x%p done\n", i, svms);
}
  }
@@ -1979,43 +1982,43 @@ static void svm_range_deferred_list_work(struct 
work_struct *work)
struct svm_range_list *svms;
struct svm_range *prange;
struct mm_struct *mm;
+   struct kfd_process *p;
  
  	svms = container_of(work, struct svm_range_list, deferred_list_work);

pr_debug("enter svms 0x%p\n", svms);
  
+	p = container_of(svms, struct kfd_process, svms);

+   mm = p->mm;
+
+   /* Take mm->mm_users to avoid mm is gone when inserting mmu notifier */
+   if (!mm || !mmget_not_zero(mm)) {


get_task_mm would be safer than relying on p->mm. I regret ever adding 
that to the process structure.




+   pr_debug("svms 0x%p process mm gone\n", svms);
+   return;
+   }
+retry:
+   mmap_write_lock(mm);
+
+   /* Checking for the need to drain retry faults must be inside
+* mmap write lock to serialize with munmap notifiers.
+*/
+   if (unlikely(READ_ONCE(svms->drain_pagefaults))) {
+   WRITE_ONCE(svms->drain_pagefaults, false);
+   mmap_write_unlock(mm);
+   svm_range_drain_retry_fault(svms);
+   goto retry;
+   }
+
spin_lock(&svms->deferred_list_lock);
while (!list_empty(&svms->deferred_range_list)) {
prange = list_first_entry(&svms->deferred_range_list,
  struct svm_range, deferred_list);
+   list_del_init(&prange->deferred_list);
spin_unlock(&svms->deferred_list_lock);
+
pr_debug("prange 0x%p [0x%lx 0x%lx] op %d\n", prange,
 prange->start, prange->last, prange->work_item.op);
  
-		mm = prange->work_item.mm;

-retry:
-   mmap_write_lock(mm);
mutex_lock(&svms-

Re: [PATCH] drm/amdkfd: Retrieve SDMA numbers from amdgpu

2021-11-17 Thread Felix Kuehling

On 2021-11-17 3:36 p.m., Amber Lin wrote:

Instead of hard coding the number of sdma engines and the number of
sdma_xgmi engines in the device_info table, get the number of toal SDMA
instances from amdgpu. The first two engines are sdma engines and the
rest are sdma-xgmi engines unless the ASIC doesn't support XGMI.

v2: Move get_num_*_sdma_engines to global and shared by queues manager
 and topology.
v3: Use gmc.xgmi.supported to justify the SDMA PCIe/XGMI assignment


You can remove this version history, since this is the first public review.

One more nit-pick inline.




Signed-off-by: Amber Lin 
Reviewed-by: Felix Kuehling 
---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c   | 20 
  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 31 +++
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  3 ++
  drivers/gpu/drm/amd/amdkfd/kfd_topology.c |  5 ++-
  4 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index ce9f4e562bac..ec1f6bacb61e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1516,6 +1516,26 @@ void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, 
uint64_t throttle_bitmask)
kfd_smi_event_update_thermal_throttling(kfd, throttle_bitmask);
  }
  
+/* get_num_sdma_engines returns the number of PCIe optimized SDMA and

+ * get_num_xgmi_sdma_engines returns the number of XGMI SDMA.
+ * When the device has more than two engines, we reserve two for PCIe to enable
+ * full-duplex and the rest are used as XGMI.
+ */
+unsigned int get_num_sdma_engines(struct kfd_dev *kdev)


These two non-static functions should have a proper kfd_ name prefix.

Regards,
  Felix



+{
+   /* If XGMI is not supported, all SDMA engines are PCIe */
+   if (!kdev->adev->gmc.xgmi.supported)
+   return kdev->adev->sdma.num_instances;
+
+   return min(kdev->adev->sdma.num_instances, 2);
+}
+
+unsigned int get_num_xgmi_sdma_engines(struct kfd_dev *kdev)
+{
+   /* After reserved for PCIe, the rest of engines are XGMI */
+   return kdev->adev->sdma.num_instances - get_num_sdma_engines(kdev);
+}
+
  #if defined(CONFIG_DEBUG_FS)
  
  /* This function will send a package to HIQ to hang the HWS

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 62fe28244a80..5f2886cf4d7e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -99,31 +99,22 @@ unsigned int get_pipes_per_mec(struct device_queue_manager 
*dqm)
return dqm->dev->shared_resources.num_pipe_per_mec;
  }
  
-static unsigned int get_num_sdma_engines(struct device_queue_manager *dqm)

-{
-   return dqm->dev->device_info->num_sdma_engines;
-}
-
-static unsigned int get_num_xgmi_sdma_engines(struct device_queue_manager *dqm)
-{
-   return dqm->dev->device_info->num_xgmi_sdma_engines;
-}
-
  static unsigned int get_num_all_sdma_engines(struct device_queue_manager *dqm)
  {
-   return get_num_sdma_engines(dqm) + get_num_xgmi_sdma_engines(dqm);
+   return get_num_sdma_engines(dqm->dev) +
+   get_num_xgmi_sdma_engines(dqm->dev);
  }
  
  unsigned int get_num_sdma_queues(struct device_queue_manager *dqm)

  {
-   return dqm->dev->device_info->num_sdma_engines
-   * dqm->dev->device_info->num_sdma_queues_per_engine;
+   return get_num_sdma_engines(dqm->dev) *
+   dqm->dev->device_info->num_sdma_queues_per_engine;
  }
  
  unsigned int get_num_xgmi_sdma_queues(struct device_queue_manager *dqm)

  {
-   return dqm->dev->device_info->num_xgmi_sdma_engines
-   * dqm->dev->device_info->num_sdma_queues_per_engine;
+   return get_num_xgmi_sdma_engines(dqm->dev) *
+   dqm->dev->device_info->num_sdma_queues_per_engine;
  }
  
  void program_sh_mem_settings(struct device_queue_manager *dqm,

@@ -1054,9 +1045,9 @@ static int allocate_sdma_queue(struct 
device_queue_manager *dqm,
dqm->sdma_bitmap &= ~(1ULL << bit);
q->sdma_id = bit;
q->properties.sdma_engine_id = q->sdma_id %
-   get_num_sdma_engines(dqm);
+   get_num_sdma_engines(dqm->dev);
q->properties.sdma_queue_id = q->sdma_id /
-   get_num_sdma_engines(dqm);
+   get_num_sdma_engines(dqm->dev);
} else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI) {
if (dqm->xgmi_sdma_bitmap == 0) {
pr_err("No more XGMI SDMA queue to allocate\n");
@@ -1071,10 +1062,10 @@ static int allocate_sdma_queue(struct 
device_queue_manager *dqm,
 * assumes the first N engines are always
 * PCIe-optimized ones
 

Re: [PATCH 2/2] drm/amd/pm: Print the error on command submission

2021-11-17 Thread Alex Deucher
Acked-by: Alex Deucher 

On Wed, Nov 17, 2021 at 1:56 PM Luben Tuikov  wrote:
>
> Print the error on command submission immediately after submitting to
> the SMU. This is rate-limited. It helps to immediately know there was an
> error on command submission, rather than leave it up to clients to report
> the error, as sometimes they do not.
>
> Cc: Alex Deucher 
> Signed-off-by: Luben Tuikov 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index f9a42a07eeaebf..048ca16738638f 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -352,7 +352,7 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
> *smu,
> __smu_cmn_send_msg(smu, (uint16_t) index, param);
> reg = __smu_cmn_poll_stat(smu);
> res = __smu_cmn_reg2errno(smu, reg);
> -   if (res == -EREMOTEIO)
> +   if (res != 0)
> __smu_cmn_reg_print_error(smu, reg, index, param, msg);
> if (read_arg)
> smu_cmn_read_arg(smu, read_arg);
> --
> 2.34.0
>


Re: [PATCH 1/2] drm/amd/pm: Add debug prints

2021-11-17 Thread Alex Deucher
On Wed, Nov 17, 2021 at 1:56 PM Luben Tuikov  wrote:
>
> Add prints where there are none and none are printed in the callee.
>
> Add a print in sienna_cichlid_run_btc() to help debug and to mirror other
> platforms, as no print is present in the caller, smu_smc_hw_setup().
>
> Remove the word "previous" from comment and print to make it shorter and
> avoid confusion in various prints.
>
> Cc: Alex Deucher 
> Signed-off-by: Luben Tuikov 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c   | 8 +---
>  drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 8 +++-
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 4 ++--
>  3 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 01168b8955bff3..67cc6fb4f422f4 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -1153,6 +1153,8 @@ static int smu_smc_hw_setup(struct smu_context *smu)
> case IP_VERSION(11, 5, 0):
> case IP_VERSION(11, 0, 12):
> ret = smu_system_features_control(smu, true);
> +   if (ret)
> +   dev_err(adev->dev, "Failed system features 
> control!\n");
> break;
> default:
> break;
> @@ -1277,8 +1279,10 @@ static int smu_smc_hw_setup(struct smu_context *smu)
> }
>
> ret = smu_notify_display_change(smu);
> -   if (ret)
> +   if (ret) {
> +   dev_err(adev->dev, "Failed to notify display change!\n");
> return ret;
> +   }
>
> /*
>  * Set min deep sleep dce fclk with bootup value from vbios via
> @@ -1286,8 +1290,6 @@ static int smu_smc_hw_setup(struct smu_context *smu)
>  */
> ret = smu_set_min_dcef_deep_sleep(smu,
>   smu->smu_table.boot_values.dcefclk 
> / 100);
> -   if (ret)
> -   return ret;
>
> return ret;
>  }
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> index b0bb389185d51c..f3522320df7e58 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> @@ -2135,7 +2135,13 @@ static int sienna_cichlid_od_edit_dpm_table(struct 
> smu_context *smu,
>
>  static int sienna_cichlid_run_btc(struct smu_context *smu)
>  {
> -   return smu_cmn_send_smc_msg(smu, SMU_MSG_RunDcBtc, NULL);
> +   int res;
> +
> +   res = smu_cmn_send_smc_msg(smu, SMU_MSG_RunDcBtc, NULL);
> +   if (res)
> +   dev_err(smu->adev->dev, "RunDcBtc failed!\n");
> +
> +   return res;

Maybe better to split this hunk into a separate patch and also fix up
the run_btc functions for other asics.

Alex


>  }
>
>  static int sienna_cichlid_baco_enter(struct smu_context *smu)
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index ea6f50c08c5f3b..f9a42a07eeaebf 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -97,7 +97,7 @@ static void smu_cmn_read_arg(struct smu_context *smu,
>   * smu: a pointer to SMU context
>   *
>   * Returns the status of the SMU, which could be,
> - *0, the SMU is busy with your previous command;
> + *0, the SMU is busy with your command;
>   *1, execution status: success, execution result: success;
>   * 0xFF, execution status: success, execution result: failure;
>   * 0xFE, unknown command;
> @@ -143,7 +143,7 @@ static void __smu_cmn_reg_print_error(struct smu_context 
> *smu,
> u32 msg_idx = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_66);
> u32 prm = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82);
> dev_err_ratelimited(adev->dev,
> -   "SMU: I'm not done with your previous 
> command: SMN_C2PMSG_66:0x%08X SMN_C2PMSG_82:0x%08X",
> +   "SMU: I'm not done with your command: 
> SMN_C2PMSG_66:0x%08X SMN_C2PMSG_82:0x%08X",
> msg_idx, prm);
> }
> break;
>
> base-commit: ae2faedcc13fa5ee109ceb9e8cc05d759ad57980
> --
> 2.34.0
>


Re: [PATCH -next] drm/amd/display: check top_pipe_to_program pointer

2021-11-17 Thread Alex Deucher
Applied.  Thanks!

On Mon, Nov 15, 2021 at 3:10 AM Yang Li  wrote:
>
> Clang static analysis reports this error
>
> drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:2870:7: warning:
> Dereference of null pointer [clang-analyzer-core.NullDereference]
> if
> (top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
> ^
>
> top_pipe_to_program being NULL is caught as an error
> But then it is used to report the error.
>
> So add a check before using it.
>
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  drivers/gpu/drm/amd/display/dc/core/dc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
> b/drivers/gpu/drm/amd/display/dc/core/dc.c
> index 39ad385..34382d0 100644
> --- a/drivers/gpu/drm/amd/display/dc/core/dc.c
> +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
> @@ -2867,7 +2867,8 @@ static void commit_planes_for_stream(struct dc *dc,
>  #endif
>
> if ((update_type != UPDATE_TYPE_FAST) && 
> stream->update_flags.bits.dsc_changed)
> -   if 
> (top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
> +   if (top_pipe_to_program &&
> +   
> top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
> if (should_use_dmub_lock(stream->link)) {
> union dmub_hw_lock_flags hw_locks = { 0 };
> struct dmub_hw_lock_inst_flags inst_flags = { 
> 0 };
> --
> 1.8.3.1
>


Re: [PATCH] drm/amd/display: remove no need NULL check before kfree

2021-11-17 Thread Alex Deucher
Applied.  Thanks!

On Mon, Nov 15, 2021 at 8:48 PM Bernard Zhao  wrote:
>
> This change is to cleanup the code a bit.
>
> Signed-off-by: Bernard Zhao 
> ---
>  .../drm/amd/display/dc/dcn10/dcn10_resource.c  | 18 ++
>  1 file changed, 6 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c 
> b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
> index f37551e00023..0090550d4aee 100644
> --- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
> +++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
> @@ -978,10 +978,8 @@ static void dcn10_resource_destruct(struct 
> dcn10_resource_pool *pool)
> pool->base.mpc = NULL;
> }
>
> -   if (pool->base.hubbub != NULL) {
> -   kfree(pool->base.hubbub);
> -   pool->base.hubbub = NULL;
> -   }
> +   kfree(pool->base.hubbub);
> +   pool->base.hubbub = NULL;
>
> for (i = 0; i < pool->base.pipe_count; i++) {
> if (pool->base.opps[i] != NULL)
> @@ -1011,14 +1009,10 @@ static void dcn10_resource_destruct(struct 
> dcn10_resource_pool *pool)
> for (i = 0; i < pool->base.res_cap->num_ddc; i++) {
> if (pool->base.engines[i] != NULL)
> dce110_engine_destroy(&pool->base.engines[i]);
> -   if (pool->base.hw_i2cs[i] != NULL) {
> -   kfree(pool->base.hw_i2cs[i]);
> -   pool->base.hw_i2cs[i] = NULL;
> -   }
> -   if (pool->base.sw_i2cs[i] != NULL) {
> -   kfree(pool->base.sw_i2cs[i]);
> -   pool->base.sw_i2cs[i] = NULL;
> -   }
> +   kfree(pool->base.hw_i2cs[i]);
> +   pool->base.hw_i2cs[i] = NULL;
> +   kfree(pool->base.sw_i2cs[i]);
> +   pool->base.sw_i2cs[i] = NULL;
> }
>
> for (i = 0; i < pool->base.audio_count; i++) {
> --
> 2.33.1
>


Re: [PATCH] drm/amd/display: cleanup the code a bit

2021-11-17 Thread Alex Deucher
Applied.  Thanks!

On Tue, Nov 16, 2021 at 4:19 AM Christian König
 wrote:
>
> Am 16.11.21 um 02:34 schrieb Bernard Zhao:
> > In function dc_sink_destruct, kfree will check pointer, no need
> > to check again.
> > This change is to cleanup the code a bit.
> >
> > Signed-off-by: Bernard Zhao 
>
> This one and the other patch are Acked-by: Christian König
> 
>
> > ---
> >   drivers/gpu/drm/amd/display/dc/core/dc_sink.c | 10 +-
> >   1 file changed, 1 insertion(+), 9 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_sink.c 
> > b/drivers/gpu/drm/amd/display/dc/core/dc_sink.c
> > index a249a0e5edd0..4b5e4d8e7735 100644
> > --- a/drivers/gpu/drm/amd/display/dc/core/dc_sink.c
> > +++ b/drivers/gpu/drm/amd/display/dc/core/dc_sink.c
> > @@ -33,14 +33,6 @@
> >* Private functions
> >
> > **/
> >
> > -static void dc_sink_destruct(struct dc_sink *sink)
> > -{
> > - if (sink->dc_container_id) {
> > - kfree(sink->dc_container_id);
> > - sink->dc_container_id = NULL;
> > - }
> > -}
> > -
> >   static bool dc_sink_construct(struct dc_sink *sink, const struct 
> > dc_sink_init_data *init_params)
> >   {
> >
> > @@ -75,7 +67,7 @@ void dc_sink_retain(struct dc_sink *sink)
> >   static void dc_sink_free(struct kref *kref)
> >   {
> >   struct dc_sink *sink = container_of(kref, struct dc_sink, refcount);
> > - dc_sink_destruct(sink);
> > + kfree(sink->dc_container_id);
> >   kfree(sink);
> >   }
> >
>


Re: [PATCH] drm/amd/amdgpu: fix potential memleak

2021-11-17 Thread Alex Deucher
Applied.  Thanks!

Alex

On Mon, Nov 15, 2021 at 10:56 AM Felix Kuehling  wrote:
>
> Am 2021-11-14 um 9:58 p.m. schrieb Bernard Zhao:
> > In function amdgpu_get_xgmi_hive, when kobject_init_and_add failed
> > There is a potential memleak if not call kobject_put.
> >
> > Signed-off-by: Bernard Zhao 
>
> Reviewed-by: Felix Kuehling 
>
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> > index 0fad2bf854ae..567df2db23ac 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> > @@ -386,6 +386,7 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
> > amdgpu_device *adev)
> >   "%s", "xgmi_hive_info");
> >   if (ret) {
> >   dev_err(adev->dev, "XGMI: failed initializing kobject for 
> > xgmi hive\n");
> > + kobject_put(&hive->kobj);
> >   kfree(hive);
> >   hive = NULL;
> >   goto pro_end;


Re: [PATCH v2] drm/amd/amdgpu: cleanup the code style a bit

2021-11-17 Thread Alex Deucher
Applied.  Thanks!

Alex

On Mon, Nov 15, 2021 at 7:09 AM Bernard Zhao  wrote:
>
> This change is to cleanup the code style a bit.
>
> Signed-off-by: Bernard Zhao 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 21 +
>  1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 04cf9b207e62..3fc49823f527 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -283,17 +283,15 @@ static int amdgpu_virt_init_ras_err_handler_data(struct 
> amdgpu_device *adev)
>
> *data = kmalloc(sizeof(struct amdgpu_virt_ras_err_handler_data), 
> GFP_KERNEL);
> if (!*data)
> -   return -ENOMEM;
> +   goto data_failure;
>
> bps = kmalloc_array(align_space, sizeof((*data)->bps), GFP_KERNEL);
> -   bps_bo = kmalloc_array(align_space, sizeof((*data)->bps_bo), 
> GFP_KERNEL);
> +   if (!bps)
> +   goto bps_failure;
>
> -   if (!bps || !bps_bo) {
> -   kfree(bps);
> -   kfree(bps_bo);
> -   kfree(*data);
> -   return -ENOMEM;
> -   }
> +   bps_bo = kmalloc_array(align_space, sizeof((*data)->bps_bo), 
> GFP_KERNEL);
> +   if (!bps_bo)
> +   goto bps_bo_failure;
>
> (*data)->bps = bps;
> (*data)->bps_bo = bps_bo;
> @@ -303,6 +301,13 @@ static int amdgpu_virt_init_ras_err_handler_data(struct 
> amdgpu_device *adev)
> virt->ras_init_done = true;
>
> return 0;
> +
> +bps_bo_failure:
> +   kfree(bps);
> +bps_failure:
> +   kfree(*data);
> +data_failure:
> +   return -ENOMEM;
>  }
>
>  static void amdgpu_virt_ras_release_bp(struct amdgpu_device *adev)
> --
> 2.33.1
>


Re: [PATCH] drm/amd/amdgpu: remove useless break after return

2021-11-17 Thread Alex Deucher
Applied thanks.  If you want to make the numbering more sequential,
please also update the other dce files if you make that change.

Alex

On Mon, Nov 15, 2021 at 2:14 AM Bernard Zhao  wrote:
>
> This change is to remove useless break after return.
>
> Signed-off-by: Bernard Zhao 
> ---
>  drivers/gpu/drm/amd/amdgpu/dce_v8_0.c | 4 
>  1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c 
> b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
> index b200b9e722d9..8318ee8339f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
> @@ -2092,22 +2092,18 @@ static int dce_v8_0_pick_dig_encoder(struct 
> drm_encoder *encoder)
> return 1;
> else
> return 0;
> -   break;
> case ENCODER_OBJECT_ID_INTERNAL_UNIPHY1:
> if (dig->linkb)
> return 3;
> else
> return 2;
> -   break;
> case ENCODER_OBJECT_ID_INTERNAL_UNIPHY2:
> if (dig->linkb)
> return 5;
> else
> return 4;
> -   break;
> case ENCODER_OBJECT_ID_INTERNAL_UNIPHY3:
> return 6;
> -   break;
> default:
> DRM_ERROR("invalid encoder_id: 0x%x\n", 
> amdgpu_encoder->encoder_id);
> return 0;
> --
> 2.33.1
>


Re: [PATCH v3 1/6] drm: move the buddy allocator from i915 into common drm

2021-11-17 Thread kernel test robot
Hi Arunpravin,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm/drm-next]
[also build test ERROR on drm-intel/for-linux-next v5.16-rc1]
[cannot apply to drm-tip/drm-tip next-2027]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Arunpravin/drm-move-the-buddy-allocator-from-i915-into-common-drm/2027-041920
base:   git://anongit.freedesktop.org/drm/drm drm-next
config: i386-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# 
https://github.com/0day-ci/linux/commit/47fb1aeae75661971f4526efddf4ae5b5738977f
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Arunpravin/drm-move-the-buddy-allocator-from-i915-into-common-drm/2027-041920
git checkout 47fb1aeae75661971f4526efddf4ae5b5738977f
# save the attached .config to linux build tree
mkdir build_dir
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/intel_memory_region.c:242:
>> drivers/gpu/drm/i915/selftests/intel_memory_region.c:23:10: fatal error: 
>> i915_buddy.h: No such file or directory
  23 | #include "i915_buddy.h"
 |  ^~
   compilation terminated.


vim +23 drivers/gpu/drm/i915/selftests/intel_memory_region.c

232a6ebae419193 Matthew Auld 2019-10-08  14  
340be48f2c5a3c0 Matthew Auld 2019-10-25  15  #include 
"gem/i915_gem_context.h"
b908be543e44414 Matthew Auld 2019-10-25  16  #include "gem/i915_gem_lmem.h"
232a6ebae419193 Matthew Auld 2019-10-08  17  #include 
"gem/i915_gem_region.h"
340be48f2c5a3c0 Matthew Auld 2019-10-25  18  #include 
"gem/selftests/igt_gem_utils.h"
232a6ebae419193 Matthew Auld 2019-10-08  19  #include 
"gem/selftests/mock_context.h"
99919be74aa3753 Thomas Hellström 2021-06-17  20  #include "gt/intel_engine_pm.h"
6804da20bb549e3 Chris Wilson 2019-10-27  21  #include 
"gt/intel_engine_user.h"
b908be543e44414 Matthew Auld 2019-10-25  22  #include "gt/intel_gt.h"
d53ec322dc7de32 Matthew Auld 2021-06-16 @23  #include "i915_buddy.h"
99919be74aa3753 Thomas Hellström 2021-06-17  24  #include "gt/intel_migrate.h"
ba12993c5228015 Matthew Auld 2020-01-29  25  #include "i915_memcpy.h"
d53ec322dc7de32 Matthew Auld 2021-06-16  26  #include 
"i915_ttm_buddy_manager.h"
01377a0d7e6648b Abdiel Janulgue  2019-10-25  27  #include 
"selftests/igt_flush_test.h"
2f0b97ca0211863 Matthew Auld 2019-10-08  28  #include 
"selftests/i915_random.h"
232a6ebae419193 Matthew Auld 2019-10-08  29  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip


[PATCH] drm/amdkfd: Retrieve SDMA numbers from amdgpu

2021-11-17 Thread Amber Lin
Instead of hard coding the number of sdma engines and the number of
sdma_xgmi engines in the device_info table, get the number of toal SDMA
instances from amdgpu. The first two engines are sdma engines and the
rest are sdma-xgmi engines unless the ASIC doesn't support XGMI.

v2: Move get_num_*_sdma_engines to global and shared by queues manager
and topology.
v3: Use gmc.xgmi.supported to justify the SDMA PCIe/XGMI assignment

Signed-off-by: Amber Lin 
Reviewed-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c   | 20 
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 31 +++
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  3 ++
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c |  5 ++-
 4 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index ce9f4e562bac..ec1f6bacb61e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1516,6 +1516,26 @@ void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, 
uint64_t throttle_bitmask)
kfd_smi_event_update_thermal_throttling(kfd, throttle_bitmask);
 }
 
+/* get_num_sdma_engines returns the number of PCIe optimized SDMA and
+ * get_num_xgmi_sdma_engines returns the number of XGMI SDMA.
+ * When the device has more than two engines, we reserve two for PCIe to enable
+ * full-duplex and the rest are used as XGMI.
+ */
+unsigned int get_num_sdma_engines(struct kfd_dev *kdev)
+{
+   /* If XGMI is not supported, all SDMA engines are PCIe */
+   if (!kdev->adev->gmc.xgmi.supported)
+   return kdev->adev->sdma.num_instances;
+
+   return min(kdev->adev->sdma.num_instances, 2);
+}
+
+unsigned int get_num_xgmi_sdma_engines(struct kfd_dev *kdev)
+{
+   /* After reserved for PCIe, the rest of engines are XGMI */
+   return kdev->adev->sdma.num_instances - get_num_sdma_engines(kdev);
+}
+
 #if defined(CONFIG_DEBUG_FS)
 
 /* This function will send a package to HIQ to hang the HWS
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 62fe28244a80..5f2886cf4d7e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -99,31 +99,22 @@ unsigned int get_pipes_per_mec(struct device_queue_manager 
*dqm)
return dqm->dev->shared_resources.num_pipe_per_mec;
 }
 
-static unsigned int get_num_sdma_engines(struct device_queue_manager *dqm)
-{
-   return dqm->dev->device_info->num_sdma_engines;
-}
-
-static unsigned int get_num_xgmi_sdma_engines(struct device_queue_manager *dqm)
-{
-   return dqm->dev->device_info->num_xgmi_sdma_engines;
-}
-
 static unsigned int get_num_all_sdma_engines(struct device_queue_manager *dqm)
 {
-   return get_num_sdma_engines(dqm) + get_num_xgmi_sdma_engines(dqm);
+   return get_num_sdma_engines(dqm->dev) +
+   get_num_xgmi_sdma_engines(dqm->dev);
 }
 
 unsigned int get_num_sdma_queues(struct device_queue_manager *dqm)
 {
-   return dqm->dev->device_info->num_sdma_engines
-   * dqm->dev->device_info->num_sdma_queues_per_engine;
+   return get_num_sdma_engines(dqm->dev) *
+   dqm->dev->device_info->num_sdma_queues_per_engine;
 }
 
 unsigned int get_num_xgmi_sdma_queues(struct device_queue_manager *dqm)
 {
-   return dqm->dev->device_info->num_xgmi_sdma_engines
-   * dqm->dev->device_info->num_sdma_queues_per_engine;
+   return get_num_xgmi_sdma_engines(dqm->dev) *
+   dqm->dev->device_info->num_sdma_queues_per_engine;
 }
 
 void program_sh_mem_settings(struct device_queue_manager *dqm,
@@ -1054,9 +1045,9 @@ static int allocate_sdma_queue(struct 
device_queue_manager *dqm,
dqm->sdma_bitmap &= ~(1ULL << bit);
q->sdma_id = bit;
q->properties.sdma_engine_id = q->sdma_id %
-   get_num_sdma_engines(dqm);
+   get_num_sdma_engines(dqm->dev);
q->properties.sdma_queue_id = q->sdma_id /
-   get_num_sdma_engines(dqm);
+   get_num_sdma_engines(dqm->dev);
} else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI) {
if (dqm->xgmi_sdma_bitmap == 0) {
pr_err("No more XGMI SDMA queue to allocate\n");
@@ -1071,10 +1062,10 @@ static int allocate_sdma_queue(struct 
device_queue_manager *dqm,
 * assumes the first N engines are always
 * PCIe-optimized ones
 */
-   q->properties.sdma_engine_id = get_num_sdma_engines(dqm) +
-   q->sdma_id % get_num_xgmi_sdma_engines(dqm);
+   q->properties.sdma_engine_id = get_num_sdma_engines(dqm->dev) +
+   q->sdma_id % get_n

Re: [PATCH 1/1] amdgpu/pm: restructure reporting of clock values by smu

2021-11-17 Thread Alex Deucher
On Wed, Nov 17, 2021 at 2:12 AM Lazar, Lijo  wrote:
>
>
>
> On 11/17/2021 11:50 AM, Darren Powell wrote:
> >   Use of sysfs_emit by each of the specific device implementations is 
> > problematic.
> >   To remove this back to a higher level, this patch adds a new function 
> > "get_clock_levels"
> >   to the power-management API (amd_pm_funcs) and smu powerplay API 
> > (pptable_funcs). The
> >   function returns an array of values to the power management module, which 
> > can then
> >   interpret the array and print the values at the top layer.
> >   This patch is only implemented for navi10 for some clocks to seek comment 
> > on the
> >   implementation before extending the implementation to other clock values 
> > and devices.
> >
> >   == Changes ==
> >- new power management function get_clock_levels implemented by 
> > smu_get_ppclk_levels()
> >- new power management function amdgpu_dpm_print_clock_array to print 
> > most common clock
> > list, which can be later extended to use different units (currently 
> > only MHz), or
> > other functions added to handle more complex structure (eg pcie 
> > clocks)
> >- new powerplay function get_clk_levels implemented by 
> > navi10_get_clk_levels()
> >- new helper function smu_convert_to_smuclk called by 
> > smu_print_ppclk_levels and
> > smu_get_ppclk_levels
> >- new error return value -EOPNOTSUPP for clock levels not recognized, 
> > allowing fallback
> > to print_clock_levels. (NOTE: If pm or dpm not enabled current 
> > implementation will
> > try and fail with both methods)
> >- new error return value -EINVAL for unsupported clock level
> >- new error messages output for error conditions encountered at the 
> > device specific layer
> > which can be removed if suggested
> >
> >   == Test ==
> >   LOGFILE=sysfs_emit.test.log
> >   cd ~/workspace/linux
> >   cp ${LOGFILE}{,.old}
> >
> >   AMDGPU_PCI_ADDR=`lspci -nn | grep "VGA\|Display" | cut -d " " -f 1`
> >   AMDGPU_HWMON=`ls -la /sys/class/hwmon | grep $AMDGPU_PCI_ADDR | awk 
> > '{print $9}'`
> >   HWMON_DIR=/sys/class/hwmon/${AMDGPU_HWMON}
> >
> >   lspci -nn | grep "VGA\|Display"  > $LOGFILE
> >   FILES="pp_dpm_sclk
> >   pp_dpm_pcie
> >   "
> >
>
> There hasn't been any problem with sysfs_emit for these nodes.
>
> The problem is with this kind of usage -
>
>  size = amdgpu_dpm_print_clock_levels(adev, OD_SCLK, buf);
>  size += amdgpu_dpm_print_clock_levels(adev, OD_MCLK,
> buf+size);
>  size += amdgpu_dpm_print_clock_levels(adev,
> OD_VDDC_CURVE, buf+size);
>  size += amdgpu_dpm_print_clock_levels(adev,
> OD_VDDGFX_OFFSET, buf+size);
>  size += amdgpu_dpm_print_clock_levels(adev, OD_RANGE,
> buf+size);
>  size += amdgpu_dpm_print_clock_levels(adev, OD_CCLK,
> buf+size);
>
>
> It's mixing a set of data in a particular node. Then the node needs to
> pass buf and offset separately to use sysfs_emit_at
>
> Even then, if you notice, not every data is a clock level data. Some
> like VDDC_CURVE is ASIC specific and have a different format than clock
> DPM level data.
>
> The max number of DPM levels are ASIC specific. In the below
> implementation a max of 8 levels is assumed for all ASICs, but there are
> ASICs which support 16 levels for a particualr clock. In fact the
> implementation needs to pass max+1 as in arg for list_length.
>
> Sometimes, the current frequency may not match the exact DPM level, the
> approximation of what is the current level is ASIC specific. So instead
> of current freq, ASIC implementation needs to pass levels and the
> current level.
>
> In general, the problem with sysfs_emit_at is not related to a common
> print format. pp_od_clk_voltage node is a representation of the problem
> - a node which is an aggregator of different data formats. Also, it is
> difficult to narrow down to a single one across ASICs given that there
> could be changes from one to next.

Perhaps we should just switch the swsmu and powerplay code back to
sprintf for these functions. sysfs_emit doesn't really buy us anything
at this point.

Alex

>
> Thanks,
> Lijo
>
>
>
> >   for f in $FILES
> >   do
> > echo === $f === >> $LOGFILE
> > cat $HWMON_DIR/device/$f >> $LOGFILE
> >   done
> >   cat $LOGFILE
> >
> > Signed-off-by: Darren Powell 
> > ---
> >   .../gpu/drm/amd/include/kgd_pp_interface.h|  2 +
> >   drivers/gpu/drm/amd/pm/amdgpu_pm.c| 41 +++-
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  9 ++
> >   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 53 +-
> >   .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 96 +++
> >   5 files changed, 192 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
> > b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> > index bac15c466733..fbe045811e60 100644
> > --- a/drivers/gpu/drm/amd/include/

Re: [PATCH v3 4/6] drm: implement a method to free unused pages

2021-11-17 Thread Matthew Auld

On 16/11/2021 20:18, Arunpravin wrote:

On contiguous allocation, we round up the size
to the *next* power of 2, implement a function
to free the unused pages after the newly allocate block.

v2(Matthew Auld):
   - replace function name 'drm_buddy_free_unused_pages' with
 drm_buddy_block_trim
   - replace input argument name 'actual_size' with 'new_size'
   - add more validation checks for input arguments
   - add overlaps check to avoid needless searching and splitting
   - merged the below patch to see the feature in action
 - add free unused pages support to i915 driver
   - lock drm_buddy_block_trim() function as it calls mark_free/mark_split
 are all globally visible

Signed-off-by: Arunpravin 
---
  drivers/gpu/drm/drm_buddy.c   | 108 ++
  drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  10 ++
  include/drm/drm_buddy.h   |   4 +
  3 files changed, 122 insertions(+)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 0a9db2981188..943fe2ad27bf 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -284,6 +284,114 @@ static inline bool contains(u64 s1, u64 e1, u64 s2, u64 
e2)
return s1 <= s2 && e1 >= e2;
  }
  
+/**

+ * drm_buddy_block_trim - free unused pages
+ *
+ * @mm: DRM buddy manager
+ * @new_size: original size requested
+ * @blocks: output list head to add allocated blocks
+ *
+ * For contiguous allocation, we round up the size to the nearest
+ * power of two value, drivers consume *actual* size, so remaining
+ * portions are unused and it can be freed.
+ *
+ * Returns:
+ * 0 on success, error code on failure.
+ */
+int drm_buddy_block_trim(struct drm_buddy_mm *mm,
+u64 new_size,
+struct list_head *blocks)
+{
+   struct drm_buddy_block *block;
+   struct drm_buddy_block *buddy;
+   u64 new_start;
+   u64 new_end;
+   LIST_HEAD(dfs);
+   u64 count = 0;
+   int err;
+
+   if (!list_is_singular(blocks))
+   return -EINVAL;
+
+   block = list_first_entry(blocks,
+struct drm_buddy_block,
+link);
+
+   if (!drm_buddy_block_is_allocated(block))
+   return -EINVAL;
+
+   if (new_size > drm_buddy_block_size(mm, block))
+   return -EINVAL;
+
+   if (!new_size && !IS_ALIGNED(new_size, mm->chunk_size))
+   return -EINVAL;
+
+   if (new_size == drm_buddy_block_size(mm, block))
+   return 0;
+
+   list_del(&block->link);
+
+   new_start = drm_buddy_block_offset(block);
+   new_end = new_start + new_size - 1;
+
+   mark_free(mm, block);
+
+   list_add(&block->tmp_link, &dfs);
+
+   do {
+   u64 block_start;
+   u64 block_end;
+
+   block = list_first_entry_or_null(&dfs,
+struct drm_buddy_block,
+tmp_link);
+   if (!block)
+   break;
+
+   list_del(&block->tmp_link);
+
+   if (count == new_size)
+   return 0;
+
+   block_start = drm_buddy_block_offset(block);
+   block_end = block_start + drm_buddy_block_size(mm, block) - 1;
+
+   if (!overlaps(new_start, new_end, block_start, block_end))
+   continue;
+
+   if (contains(new_start, new_end, block_start, block_end)) {
+   BUG_ON(!drm_buddy_block_is_free(block));
+
+   /* Allocate only required blocks */
+   mark_allocated(block);
+   mm->avail -= drm_buddy_block_size(mm, block);
+   list_add_tail(&block->link, blocks);
+   count += drm_buddy_block_size(mm, block);
+   continue;
+   }
+
+   if (!drm_buddy_block_is_split(block)) {


Should always be true, right? But I guess depends if we want to re-use 
this for generic range allocation...



+   err = split_block(mm, block);
+   if (unlikely(err))
+   goto err_undo;
+   }
+
+   list_add(&block->right->tmp_link, &dfs);
+   list_add(&block->left->tmp_link, &dfs);
+   } while (1);
+
+   return -ENOSPC;
+
+err_undo:
+   buddy = get_buddy(block);
+   if (buddy &&
+   (drm_buddy_block_is_free(block) &&
+drm_buddy_block_is_free(buddy)))
+   __drm_buddy_free(mm, block);
+   return err;


Looking at the split_block failure path. The user allocated some block, 
and then tried to trim it, but this then marks it as free and removes it 
from their original list(potentially also freeing it), if we fail here. 
Would it be better to leave that decision to the user? If the trim(

[PATCH 2/2] drm/amd/pm: Print the error on command submission

2021-11-17 Thread Luben Tuikov
Print the error on command submission immediately after submitting to
the SMU. This is rate-limited. It helps to immediately know there was an
error on command submission, rather than leave it up to clients to report
the error, as sometimes they do not.

Cc: Alex Deucher 
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index f9a42a07eeaebf..048ca16738638f 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -352,7 +352,7 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (res == -EREMOTEIO)
+   if (res != 0)
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
if (read_arg)
smu_cmn_read_arg(smu, read_arg);
-- 
2.34.0



[PATCH 1/2] drm/amd/pm: Add debug prints

2021-11-17 Thread Luben Tuikov
Add prints where there are none and none are printed in the callee.

Add a print in sienna_cichlid_run_btc() to help debug and to mirror other
platforms, as no print is present in the caller, smu_smc_hw_setup().

Remove the word "previous" from comment and print to make it shorter and
avoid confusion in various prints.

Cc: Alex Deucher 
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c   | 8 +---
 drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 8 +++-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 4 ++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index 01168b8955bff3..67cc6fb4f422f4 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1153,6 +1153,8 @@ static int smu_smc_hw_setup(struct smu_context *smu)
case IP_VERSION(11, 5, 0):
case IP_VERSION(11, 0, 12):
ret = smu_system_features_control(smu, true);
+   if (ret)
+   dev_err(adev->dev, "Failed system features 
control!\n");
break;
default:
break;
@@ -1277,8 +1279,10 @@ static int smu_smc_hw_setup(struct smu_context *smu)
}
 
ret = smu_notify_display_change(smu);
-   if (ret)
+   if (ret) {
+   dev_err(adev->dev, "Failed to notify display change!\n");
return ret;
+   }
 
/*
 * Set min deep sleep dce fclk with bootup value from vbios via
@@ -1286,8 +1290,6 @@ static int smu_smc_hw_setup(struct smu_context *smu)
 */
ret = smu_set_min_dcef_deep_sleep(smu,
  smu->smu_table.boot_values.dcefclk / 
100);
-   if (ret)
-   return ret;
 
return ret;
 }
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index b0bb389185d51c..f3522320df7e58 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -2135,7 +2135,13 @@ static int sienna_cichlid_od_edit_dpm_table(struct 
smu_context *smu,
 
 static int sienna_cichlid_run_btc(struct smu_context *smu)
 {
-   return smu_cmn_send_smc_msg(smu, SMU_MSG_RunDcBtc, NULL);
+   int res;
+
+   res = smu_cmn_send_smc_msg(smu, SMU_MSG_RunDcBtc, NULL);
+   if (res)
+   dev_err(smu->adev->dev, "RunDcBtc failed!\n");
+
+   return res;
 }
 
 static int sienna_cichlid_baco_enter(struct smu_context *smu)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index ea6f50c08c5f3b..f9a42a07eeaebf 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -97,7 +97,7 @@ static void smu_cmn_read_arg(struct smu_context *smu,
  * smu: a pointer to SMU context
  *
  * Returns the status of the SMU, which could be,
- *0, the SMU is busy with your previous command;
+ *0, the SMU is busy with your command;
  *1, execution status: success, execution result: success;
  * 0xFF, execution status: success, execution result: failure;
  * 0xFE, unknown command;
@@ -143,7 +143,7 @@ static void __smu_cmn_reg_print_error(struct smu_context 
*smu,
u32 msg_idx = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_66);
u32 prm = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82);
dev_err_ratelimited(adev->dev,
-   "SMU: I'm not done with your previous 
command: SMN_C2PMSG_66:0x%08X SMN_C2PMSG_82:0x%08X",
+   "SMU: I'm not done with your command: 
SMN_C2PMSG_66:0x%08X SMN_C2PMSG_82:0x%08X",
msg_idx, prm);
}
break;

base-commit: ae2faedcc13fa5ee109ceb9e8cc05d759ad57980
-- 
2.34.0



Re: [PATCH v3 2/6] drm: improve drm_buddy_alloc function

2021-11-17 Thread Matthew Auld

On 16/11/2021 20:18, Arunpravin wrote:

- Make drm_buddy_alloc a single function to handle
   range allocation and non-range allocation demands

- Implemented a new function alloc_range() which allocates
   the requested power-of-two block comply with range limitations

- Moved order computation and memory alignment logic from
   i915 driver to drm buddy

v2:
   merged below changes to keep the build unbroken
- drm_buddy_alloc_range() becomes obsolete and may be removed
- enable ttm range allocation (fpfn / lpfn) support in i915 driver
- apply enhanced drm_buddy_alloc() function to i915 driver

v3(Matthew Auld):
   - Fix alignment issues and remove unnecessary list_empty check
   - add more validation checks for input arguments
   - make alloc_range() block allocations as bottom-up
   - optimize order computation logic
   - replace uint64_t with u64, which is preferred in the kernel

Signed-off-by: Arunpravin 
---
  drivers/gpu/drm/drm_buddy.c   | 259 ++
  drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  69 ++---
  drivers/gpu/drm/i915/i915_ttm_buddy_manager.h |   2 +
  include/drm/drm_buddy.h   |  22 +-
  4 files changed, 203 insertions(+), 149 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 39eb1d224bec..c9b18a29f8d1 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -274,63 +274,6 @@ void drm_buddy_free_list(struct drm_buddy_mm *mm, struct 
list_head *objects)
  }
  EXPORT_SYMBOL(drm_buddy_free_list);
  
-/**

- * drm_buddy_alloc - allocate power-of-two blocks
- *
- * @mm: DRM buddy manager to allocate from
- * @order: size of the allocation
- *
- * The order value here translates to:
- *
- * 0 = 2^0 * mm->chunk_size
- * 1 = 2^1 * mm->chunk_size
- * 2 = 2^2 * mm->chunk_size
- *
- * Returns:
- * allocated ptr to the &drm_buddy_block on success
- */
-struct drm_buddy_block *
-drm_buddy_alloc(struct drm_buddy_mm *mm, unsigned int order)
-{
-   struct drm_buddy_block *block = NULL;
-   unsigned int i;
-   int err;
-
-   for (i = order; i <= mm->max_order; ++i) {
-   block = list_first_entry_or_null(&mm->free_list[i],
-struct drm_buddy_block,
-link);
-   if (block)
-   break;
-   }
-
-   if (!block)
-   return ERR_PTR(-ENOSPC);
-
-   BUG_ON(!drm_buddy_block_is_free(block));
-
-   while (i != order) {
-   err = split_block(mm, block);
-   if (unlikely(err))
-   goto out_free;
-
-   /* Go low */
-   block = block->left;
-   i--;
-   }
-
-   mark_allocated(block);
-   mm->avail -= drm_buddy_block_size(mm, block);
-   kmemleak_update_trace(block);
-   return block;
-
-out_free:
-   if (i != order)
-   __drm_buddy_free(mm, block);
-   return ERR_PTR(err);
-}
-EXPORT_SYMBOL(drm_buddy_alloc);
-
  static inline bool overlaps(u64 s1, u64 e1, u64 s2, u64 e2)
  {
return s1 <= e2 && e1 >= s2;
@@ -341,52 +284,22 @@ static inline bool contains(u64 s1, u64 e1, u64 s2, u64 
e2)
return s1 <= s2 && e1 >= e2;
  }
  
-/**

- * drm_buddy_alloc_range - allocate range
- *
- * @mm: DRM buddy manager to allocate from
- * @blocks: output list head to add allocated blocks
- * @start: start of the allowed range for this block
- * @size: size of the allocation
- *
- * Intended for pre-allocating portions of the address space, for example to
- * reserve a block for the initial framebuffer or similar, hence the 
expectation
- * here is that drm_buddy_alloc() is still the main vehicle for
- * allocations, so if that's not the case then the drm_mm range allocator is
- * probably a much better fit, and so you should probably go use that instead.
- *
- * Note that it's safe to chain together multiple alloc_ranges
- * with the same blocks list
- *
- * Returns:
- * 0 on success, error code on failure.
- */
-int drm_buddy_alloc_range(struct drm_buddy_mm *mm,
- struct list_head *blocks,
- u64 start, u64 size)
+static struct drm_buddy_block *
+alloc_range(struct drm_buddy_mm *mm,
+   u64 start, u64 end,
+   unsigned int order)
  {
struct drm_buddy_block *block;
struct drm_buddy_block *buddy;
-   LIST_HEAD(allocated);
LIST_HEAD(dfs);
-   u64 end;
int err;
int i;
  
-	if (size < mm->chunk_size)

-   return -EINVAL;
-
-   if (!IS_ALIGNED(size | start, mm->chunk_size))
-   return -EINVAL;
-
-   if (range_overflows(start, size, mm->size))
-   return -EINVAL;
+   end = end - 1;
  
  	for (i = 0; i < mm->n_roots; ++i)

list_add_tail(&mm->roots[i]->tmp_link, &dfs);
  
-	end = start + size - 1;

-
do {
u64 block_s

Re: [PATCH v3 1/6] drm: move the buddy allocator from i915 into common drm

2021-11-17 Thread Matthew Auld

On 16/11/2021 20:18, Arunpravin wrote:

Move the base i915 buddy allocator code into drm
- Move i915_buddy.h to include/drm
- Move i915_buddy.c to drm root folder
- Rename "i915" string with "drm" string wherever applicable
- Rename "I915" string with "DRM" string wherever applicable
- Fix header file dependencies
- Fix alignment issues
- add Makefile support for drm buddy
- export functions and write kerneldoc description
- Remove i915 selftest config check condition as buddy selftest
   will be moved to drm selftest folder

cleanup i915 buddy references in i915 driver module
and replace with drm buddy

v2:
   - include header file in alphabetical order (Thomas)
   - merged changes listed in the body section into a single patch
 to keep the build intact (Christian, Jani)

Signed-off-by: Arunpravin 


Any ideas for what to do with the existing selftests? Currently this 
series doesn't build yet for i915 due to this, and prevents throwing the 
series at CI.


Re: [PATCH v3 1/6] drm: move the buddy allocator from i915 into common drm

2021-11-17 Thread Arunpravin
Hi Christian,
I will make this a separate module.

Thanks,
Arun

On 17/11/21 1:33 pm, Christian König wrote:
> I've looked a bit more into this and I think we should just follow 
> Thomas Zimmermann's idea to make this a separate module.
> 
> Otherwise we just have the code around all the time even if it is unused 
> and implementing this should be trivial.
> 
> See how DRM_GEM_CMA_HELPER or DRM_VRAM_HELPER are done in 
> drivers/gpu/drm/Kconfig and then have the desired effect in 
> drivers/gpu/drm/Makefile
> 
> Should be something like ~15 lines of additional configuration which 
> saves a bit of memory on embedded systems which just need a fbdev 
> replacement.
> 
> Thanks,
> Christian.
> 
> Am 16.11.21 um 21:18 schrieb Arunpravin:
>> Move the base i915 buddy allocator code into drm
>> - Move i915_buddy.h to include/drm
>> - Move i915_buddy.c to drm root folder
>> - Rename "i915" string with "drm" string wherever applicable
>> - Rename "I915" string with "DRM" string wherever applicable
>> - Fix header file dependencies
>> - Fix alignment issues
>> - add Makefile support for drm buddy
>> - export functions and write kerneldoc description
>> - Remove i915 selftest config check condition as buddy selftest
>>will be moved to drm selftest folder
>>
>> cleanup i915 buddy references in i915 driver module
>> and replace with drm buddy
>>
>> v2:
>>- include header file in alphabetical order (Thomas)
>>- merged changes listed in the body section into a single patch
>>  to keep the build intact (Christian, Jani)
>>
>> Signed-off-by: Arunpravin 
>> ---
>>   drivers/gpu/drm/Makefile  |   2 +-
>>   drivers/gpu/drm/drm_buddy.c   | 521 ++
>>   drivers/gpu/drm/drm_drv.c |   3 +
>>   drivers/gpu/drm/i915/Makefile |   1 -
>>   drivers/gpu/drm/i915/i915_buddy.c | 466 
>>   drivers/gpu/drm/i915/i915_buddy.h | 143 -
>>   drivers/gpu/drm/i915/i915_module.c|   3 -
>>   drivers/gpu/drm/i915/i915_scatterlist.c   |  11 +-
>>   drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  33 +-
>>   drivers/gpu/drm/i915/i915_ttm_buddy_manager.h |   4 +-
>>   include/drm/drm_buddy.h   | 153 +
>>   11 files changed, 702 insertions(+), 638 deletions(-)
>>   create mode 100644 drivers/gpu/drm/drm_buddy.c
>>   delete mode 100644 drivers/gpu/drm/i915/i915_buddy.c
>>   delete mode 100644 drivers/gpu/drm/i915/i915_buddy.h
>>   create mode 100644 include/drm/drm_buddy.h
>>
>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>> index 0dff40bb863c..dc61e91a3154 100644
>> --- a/drivers/gpu/drm/Makefile
>> +++ b/drivers/gpu/drm/Makefile
>> @@ -18,7 +18,7 @@ drm-y   := drm_aperture.o drm_auth.o drm_cache.o \
>>  drm_dumb_buffers.o drm_mode_config.o drm_vblank.o \
>>  drm_syncobj.o drm_lease.o drm_writeback.o drm_client.o \
>>  drm_client_modeset.o drm_atomic_uapi.o drm_hdcp.o \
>> -drm_managed.o drm_vblank_work.o
>> +drm_managed.o drm_vblank_work.o drm_buddy.o
>>   
>>   drm-$(CONFIG_DRM_LEGACY) += drm_agpsupport.o drm_bufs.o drm_context.o 
>> drm_dma.o \
>>  drm_legacy_misc.o drm_lock.o drm_memory.o 
>> drm_scatter.o \
>> diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
>> new file mode 100644
>> index ..39eb1d224bec
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_buddy.c
>> @@ -0,0 +1,521 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2021 Intel Corporation
>> + */
>> +
>> +#include 
>> +#include 
>> +
>> +#include 
>> +
>> +static struct kmem_cache *slab_blocks;
>> +
>> +static struct drm_buddy_block *drm_block_alloc(struct drm_buddy_mm *mm,
>> +   struct drm_buddy_block *parent,
>> +   unsigned int order,
>> +   u64 offset)
>> +{
>> +struct drm_buddy_block *block;
>> +
>> +BUG_ON(order > DRM_BUDDY_MAX_ORDER);
>> +
>> +block = kmem_cache_zalloc(slab_blocks, GFP_KERNEL);
>> +if (!block)
>> +return NULL;
>> +
>> +block->header = offset;
>> +block->header |= order;
>> +block->parent = parent;
>> +
>> +BUG_ON(block->header & DRM_BUDDY_HEADER_UNUSED);
>> +return block;
>> +}
>> +
>> +static void drm_block_free(struct drm_buddy_mm *mm,
>> +   struct drm_buddy_block *block)
>> +{
>> +kmem_cache_free(slab_blocks, block);
>> +}
>> +
>> +static void mark_allocated(struct drm_buddy_block *block)
>> +{
>> +block->header &= ~DRM_BUDDY_HEADER_STATE;
>> +block->header |= DRM_BUDDY_ALLOCATED;
>> +
>> +list_del(&block->link);
>> +}
>> +
>> +static void mark_free(struct drm_buddy_mm *mm,
>> +  struct drm_buddy_block *block)
>> +{
>> +block->header &= ~DRM_BUDDY_HEADER_STATE;
>> +block->header |= DRM_BUDDY_FR

Re: Backlight control broken on UM325 (OLED) on 5.15 (bisected)

2021-11-17 Thread Samuel Čavoj
Hi Roman,

On 17.11.2021 15:26, Li, Roman wrote:
> [Public]
> 
> Hi Samuel,
> 
> Can you please try: https://patchwork.freedesktop.org/patch/463485/ ?

Yup, that did the trick. Works as before. Thank you very much.

Samuel

> 
> Thanks,
> Roman
> 
> > -Original Message-
> > From: Samuel Čavoj 
> > Sent: Tuesday, November 16, 2021 8:33 AM
> > To: Alex Deucher 
> > Cc: Deucher, Alexander ; Li, Sun peng (Leo)
> > ; Li, Roman ; Maling list - DRI
> > developers ; LKML  > ker...@vger.kernel.org>; amd-gfx list 
> > Subject: Re: Backlight control broken on UM325 (OLED) on 5.15 (bisected)
> >
> > Hi Alex,
> >
> > thank you for your response.
> >
> > On 15.11.2021 10:43, Alex Deucher wrote:
> > > [...]
> > >
> > > That patch adds support for systems with multiple backlights.  Do you
> > > have multiple backlight devices now?  If so, does the other one work?
> >
> > No, there is still only one backlight device -- amdgpu_bl0.
> > >
> > > Can you also try this patch?
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > > index 4811b0faafd9..67163c9d49e6 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > > @@ -854,8 +854,8 @@ int amdgpu_acpi_init(struct amdgpu_device *adev)
> > > if (amdgpu_device_has_dc_support(adev)) {  #if
> > > defined(CONFIG_DRM_AMD_DC)
> > > struct amdgpu_display_manager *dm = &adev->dm;
> > > -   if (dm->backlight_dev[0])
> > > -   atif->bd = dm->backlight_dev[0];
> > > +   if (dm->backlight_dev[1])
> > > +   atif->bd = dm->backlight_dev[1];
> > >  #endif
> > > } else {
> > > struct drm_encoder *tmp;
> > >
> >
> > There is no difference in behaviour after applying the patch.
> >
> > Samuel
> >
> > >
> > > Alex
> > >
> > > >
> > > > Regards,
> > > > Samuel Čavoj
> > > >
> > > > [0]:
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> > > >
> > w.reddit.com%2Fr%2FAMDLaptops%2Fcomments%2Fqst0fm%2Fafter_updating
> > _t
> > > >
> > o_linux_515_my_brightness%2F&data=04%7C01%7CRoman.Li%40amd.co
> > m%7
> > > >
> > Ce1c766a2f7014cdb664308d9a9059cc6%7C3dd8961fe4884e608e11a82d994e1
> > 83d
> > > >
> > %7C0%7C0%7C637726663861883494%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi
> > MC4wLj
> > > >
> > AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&s
> > dat
> > > >
> > a=hfsaEzng9%2FjAI2F%2BKg87Tv2Mu%2FfPurCQELr62%2B%2FVF%2BQ%3D&a
> > mp;res
> > > > erved=0


Re: [PATCH] drm/amd/display: Fix OLED brightness control on eDP

2021-11-17 Thread Samuel Čavoj
On 17.11.2021 10:22, roman...@amd.com wrote:
> From: Roman Li 
> 
> [Why]
> After commit ("drm/amd/display: add support for multiple backlights")
> number of eDPs is defined while registering backlight device.
> However the panel's extended caps get updated once before register call.
> That leads to regression with extended caps like oled brightness control.
> 
> [How]
> Update connector ext caps after register_backlight_device
> 
> Fixes: b1c61212d8dc ("drm/amd/display: add support for multiple backlights")
> Link: 
> https://www.reddit.com/r/AMDLaptops/comments/qst0fm/after_updating_to_linux_515_my_brightness/

Tested-By: Samuel Čavoj 

> 
> Signed-off-by: Roman Li 
> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 26fcc89..44c9994 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -4243,7 +4243,8 @@ static int amdgpu_dm_initialize_drm_device(struct 
> amdgpu_device *adev)
>   } else if (dc_link_detect(link, DETECT_REASON_BOOT)) {
>   amdgpu_dm_update_connector_after_detect(aconnector);
>   register_backlight_device(dm, link);
> -
> + if (dm->num_of_edps)
> + update_connector_ext_caps(aconnector);
>   if (psr_feature_enabled)
>   amdgpu_dm_set_psr_caps(link);
>   }
> -- 
> 2.7.4
> 


Re: [PATCH] drm/amd/display: Fix OLED brightness control on eDP

2021-11-17 Thread Alex Deucher
On Wed, Nov 17, 2021 at 10:22 AM  wrote:
>
> From: Roman Li 
>
> [Why]
> After commit ("drm/amd/display: add support for multiple backlights")
> number of eDPs is defined while registering backlight device.
> However the panel's extended caps get updated once before register call.
> That leads to regression with extended caps like oled brightness control.
>
> [How]
> Update connector ext caps after register_backlight_device
>
> Fixes: b1c61212d8dc ("drm/amd/display: add support for multiple backlights")
> Link: 
> https://www.reddit.com/r/AMDLaptops/comments/qst0fm/after_updating_to_linux_515_my_brightness/

Acked-by: Alex Deucher 

>
> Signed-off-by: Roman Li 
> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 26fcc89..44c9994 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -4243,7 +4243,8 @@ static int amdgpu_dm_initialize_drm_device(struct 
> amdgpu_device *adev)
> } else if (dc_link_detect(link, DETECT_REASON_BOOT)) {
> amdgpu_dm_update_connector_after_detect(aconnector);
> register_backlight_device(dm, link);
> -
> +   if (dm->num_of_edps)
> +   update_connector_ext_caps(aconnector);
> if (psr_feature_enabled)
> amdgpu_dm_set_psr_caps(link);
> }
> --
> 2.7.4
>


RE: Backlight control broken on UM325 (OLED) on 5.15 (bisected)

2021-11-17 Thread Li, Roman
[Public]

Hi Samuel,

Can you please try: https://patchwork.freedesktop.org/patch/463485/ ?

Thanks,
Roman

> -Original Message-
> From: Samuel Čavoj 
> Sent: Tuesday, November 16, 2021 8:33 AM
> To: Alex Deucher 
> Cc: Deucher, Alexander ; Li, Sun peng (Leo)
> ; Li, Roman ; Maling list - DRI
> developers ; LKML  ker...@vger.kernel.org>; amd-gfx list 
> Subject: Re: Backlight control broken on UM325 (OLED) on 5.15 (bisected)
>
> Hi Alex,
>
> thank you for your response.
>
> On 15.11.2021 10:43, Alex Deucher wrote:
> > [...]
> >
> > That patch adds support for systems with multiple backlights.  Do you
> > have multiple backlight devices now?  If so, does the other one work?
>
> No, there is still only one backlight device -- amdgpu_bl0.
> >
> > Can you also try this patch?
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > index 4811b0faafd9..67163c9d49e6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > @@ -854,8 +854,8 @@ int amdgpu_acpi_init(struct amdgpu_device *adev)
> > if (amdgpu_device_has_dc_support(adev)) {  #if
> > defined(CONFIG_DRM_AMD_DC)
> > struct amdgpu_display_manager *dm = &adev->dm;
> > -   if (dm->backlight_dev[0])
> > -   atif->bd = dm->backlight_dev[0];
> > +   if (dm->backlight_dev[1])
> > +   atif->bd = dm->backlight_dev[1];
> >  #endif
> > } else {
> > struct drm_encoder *tmp;
> >
>
> There is no difference in behaviour after applying the patch.
>
> Samuel
>
> >
> > Alex
> >
> > >
> > > Regards,
> > > Samuel Čavoj
> > >
> > > [0]:
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> > >
> w.reddit.com%2Fr%2FAMDLaptops%2Fcomments%2Fqst0fm%2Fafter_updating
> _t
> > >
> o_linux_515_my_brightness%2F&data=04%7C01%7CRoman.Li%40amd.co
> m%7
> > >
> Ce1c766a2f7014cdb664308d9a9059cc6%7C3dd8961fe4884e608e11a82d994e1
> 83d
> > >
> %7C0%7C0%7C637726663861883494%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi
> MC4wLj
> > >
> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&s
> dat
> > >
> a=hfsaEzng9%2FjAI2F%2BKg87Tv2Mu%2FfPurCQELr62%2B%2FVF%2BQ%3D&a
> mp;res
> > > erved=0


[PATCH] drm/amd/display: Fix OLED brightness control on eDP

2021-11-17 Thread Roman.Li
From: Roman Li 

[Why]
After commit ("drm/amd/display: add support for multiple backlights")
number of eDPs is defined while registering backlight device.
However the panel's extended caps get updated once before register call.
That leads to regression with extended caps like oled brightness control.

[How]
Update connector ext caps after register_backlight_device

Fixes: b1c61212d8dc ("drm/amd/display: add support for multiple backlights")
Link: 
https://www.reddit.com/r/AMDLaptops/comments/qst0fm/after_updating_to_linux_515_my_brightness/

Signed-off-by: Roman Li 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 26fcc89..44c9994 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -4243,7 +4243,8 @@ static int amdgpu_dm_initialize_drm_device(struct 
amdgpu_device *adev)
} else if (dc_link_detect(link, DETECT_REASON_BOOT)) {
amdgpu_dm_update_connector_after_detect(aconnector);
register_backlight_device(dm, link);
-
+   if (dm->num_of_edps)
+   update_connector_ext_caps(aconnector);
if (psr_feature_enabled)
amdgpu_dm_set_psr_caps(link);
}
-- 
2.7.4



Re: [PATCH v1 1/9] mm: add zone device coherent type memory support

2021-11-17 Thread Christoph Hellwig
On Mon, Nov 15, 2021 at 01:30:18PM -0600, Alex Sierra wrote:
> @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   * target for charge migration. if @target is not NULL, the entry is 
> stored
>   * in target->ent.
> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
> MEMORY_DEVICE_PRIVATE
> - * (so ZONE_DEVICE page and thus not on the lru).
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
> MEMORY_DEVICE_COHERENT
> + * or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the 
> lru).

Please avoid the overly long line.  But I don't think we we need to mention
the exact enum, but rather do something like:

 *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
 * thus not on the lru.

> + switch (pgmap->type) {
> + case MEMORY_DEVICE_PRIVATE:
> + case MEMORY_DEVICE_COHERENT:
>   /*
>* TODO: Handle HMM pages which may need coordination
>* with device-side memory.

This might be a good opportunity for doing a s/HMM/device/ here.


Re: [PATCH Review 2/4] drm/amdgpu: add new query interface for umc block

2021-11-17 Thread Lazar, Lijo




On 11/17/2021 3:41 PM, Stanley.Yang wrote:

add message smu to query error information

Signed-off-by: Stanley.Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  16 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |   4 +
  drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   | 161 
  3 files changed, 181 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index cdd0010a5389..bcbf3264d92f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -320,6 +320,19 @@ struct ras_common_if {
char name[32];
  };
  
+#define MAX_UMC_CHANNEL_NUM 32

+
+struct ecc_info_per_ch {
+   uint16_t ce_count_lo_chip;
+   uint16_t ce_count_hi_chip;
+   uint64_t mca_umc_status;
+   uint64_t mca_umc_addr;
+};
+
+struct umc_ecc_info {
+   struct ecc_info_per_ch ecc[MAX_UMC_CHANNEL_NUM];
+};
+
  struct amdgpu_ras {
/* ras infrastructure */
/* for ras itself. */
@@ -359,6 +372,9 @@ struct amdgpu_ras {
struct delayed_work ras_counte_delay_work;
atomic_t ras_ue_count;
atomic_t ras_ce_count;
+
+   /* record umc error info queried from smu */
+   struct umc_ecc_info umc_ecc;
  };
  
  struct ras_fs_data {

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 1f5fe2315236..7aa9b21eb906 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -49,6 +49,10 @@ struct amdgpu_umc_ras_funcs {
void (*query_ras_error_address)(struct amdgpu_device *adev,
void *ras_error_status);
bool (*query_ras_poison_mode)(struct amdgpu_device *adev);
+   void (*message_smu_query_ras_error_count)(struct amdgpu_device *adev,
+ void *ras_error_status);
+   void (*message_smu_query_ras_error_address)(struct amdgpu_device *adev,
+   void *ras_error_status);


Maybe rename message_smu to ecc_info. These methods fetch the error from 
umc_ecc_info table. They don't deal with smu or care about how the 
information gets filled. As long as ecc_info_table is filled, they could 
get the info.


Thanks,
Lijo


  };
  
  struct amdgpu_umc_funcs {

diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c 
b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
index f7ec3fe134e5..cd96e8b734cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
+++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
@@ -50,6 +50,165 @@ static inline uint32_t get_umc_v6_7_reg_offset(struct 
amdgpu_device *adev,
return adev->umc.channel_offs * ch_inst + UMC_V6_7_INST_DIST * umc_inst;
  }
  
+static inline uint32_t get_umc_v6_7_channel_index(struct amdgpu_device *adev,

+ uint32_t umc_inst,
+ uint32_t ch_inst)
+{
+   return adev->umc.channel_idx_tbl[umc_inst * adev->umc.channel_inst_num 
+ ch_inst];
+}
+
+static void umc_v6_7_message_smu_query_correctable_error_count(struct 
amdgpu_device *adev,
+  uint32_t channel_index,
+  unsigned long *error_count)
+{
+   uint32_t ecc_err_cnt;
+   uint64_t mc_umc_status;
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   /*
+* select the lower chip and check the error count
+* skip add error count, calc error counter only from mca_umc_status
+*/
+   ecc_err_cnt = ras->umc_ecc.ecc[channel_index].ce_count_lo_chip;
+
+   /*
+* select the higher chip and check the err counter
+* skip add error count, calc error counter only from mca_umc_status
+*/
+   ecc_err_cnt = ras->umc_ecc.ecc[channel_index].ce_count_hi_chip;
+
+   /* check for SRAM correctable error
+ MCUMC_STATUS is a 64 bit register */
+   mc_umc_status = ras->umc_ecc.ecc[channel_index].mca_umc_status;
+   if (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1 
&&
+   REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 
1)
+   *error_count += 1;
+}
+
+static void umc_v6_7_message_smu_querry_uncorrectable_error_count(struct 
amdgpu_device *adev,
+ uint32_t channel_index,
+ unsigned long 
*error_count)
+{
+   uint64_t mc_umc_status;
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   /* check the MCUMC_STATUS */
+   mc_umc_status = ras->umc_ecc.ecc[channel_index].mca_umc_status;
+   if ((REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1) 
&&
+   (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, 
Deferred) == 1 ||
+   REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, UECC) == 
1 ||
+ 

Re: [PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table

2021-11-17 Thread Lazar, Lijo




On 11/17/2021 3:41 PM, Stanley.Yang wrote:

support ECC TABLE message, this table include unc ras error count
and error address

Signed-off-by: Stanley.Yang 
---
  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 
  .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38 +++
  .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 24 
  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
  5 files changed, 74 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 3557f4e7fc30..ea65de0160c3 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -324,6 +324,7 @@ enum smu_table_id
SMU_TABLE_OVERDRIVE,
SMU_TABLE_I2C_COMMANDS,
SMU_TABLE_PACE,
+   SMU_TABLE_ECCINFO,
SMU_TABLE_COUNT,
  };
  
@@ -340,6 +341,7 @@ struct smu_table_context

void*max_sustainable_clocks;
struct smu_bios_boot_up_values  boot_values;
void*driver_pptable;
+   void*ecc_table;
struct smu_tabletables[SMU_TABLE_COUNT];
/*
 * The driver table is just a staging buffer for
@@ -1261,6 +1263,11 @@ struct pptable_funcs {
 *  
of SMUBUS table.
 */
int (*send_hbm_bad_pages_num)(struct smu_context *smu, uint32_t size);
+
+   /**
+* @get_ecc_table:  message SMU to get ECC INFO table.
+*/
+   ssize_t (*get_ecc_info)(struct smu_context *smu, void *table);
  };
  
  typedef enum {

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index f835d86cc2f5..5e4ba0e14a91 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -190,6 +190,7 @@ static const struct cmn2asic_mapping 
aldebaran_table_map[SMU_TABLE_COUNT] = {
TAB_MAP(SMU_METRICS),
TAB_MAP(DRIVER_SMU_CONFIG),
TAB_MAP(I2C_COMMANDS),
+   TAB_MAP(ECCINFO),
  };
  
  static const uint8_t aldebaran_throttler_map[] = {

@@ -223,6 +224,9 @@ static int aldebaran_tables_init(struct smu_context *smu)
SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS, sizeof(SwI2cRequest_t),
   PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
  
+	SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO, sizeof(EccInfoTable_t),

+  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
+
smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t), GFP_KERNEL);
if (!smu_table->metrics_table)
return -ENOMEM;
@@ -235,6 +239,10 @@ static int aldebaran_tables_init(struct smu_context *smu)
return -ENOMEM;
}
  
+	smu_table->ecc_table = kzalloc(tables[SMU_TABLE_ECCINFO].size, GFP_KERNEL);

+   if (!smu_table->ecc_table)
+   return -ENOMEM;
+
return 0;
  }
  
@@ -1765,6 +1773,35 @@ static ssize_t aldebaran_get_gpu_metrics(struct smu_context *smu,

return sizeof(struct gpu_metrics_v1_3);
  }
  
+static ssize_t aldebaran_get_ecc_info(struct smu_context *smu,

+void *table)
+{
+   struct smu_table_context *smu_table = &smu->smu_table;
+   EccInfoTable_t ecc_table;
+   struct ecc_info_per_ch *ecc_info_per_channel = NULL;
+   int i, ret = 0;
+   struct umc_ecc_info *eccinfo = (struct umc_ecc_info *)table;
+
+   ret = smu_cmn_get_ecc_info_table(smu,
+   &ecc_table);
+   if (ret)
+   return ret;
+
+   for (i = 0; i < ALDEBARAN_UMC_CHANNEL_NUM; i++) {
+   ecc_info_per_channel = &(eccinfo->ecc[i]);
+   ecc_info_per_channel->ce_count_lo_chip =
+   ecc_table.EccInfo[i].ce_count_lo_chip;
+   ecc_info_per_channel->ce_count_hi_chip =
+   ecc_table.EccInfo[i].ce_count_hi_chip;
+   ecc_info_per_channel->mca_umc_status =
+   ecc_table.EccInfo[i].mca_umc_status;
+   ecc_info_per_channel->mca_umc_addr =
+   ecc_table.EccInfo[i].mca_umc_addr;
+   }
+
+   return ret;
+}
+
  static int aldebaran_mode1_reset(struct smu_context *smu)
  {
u32 smu_version, fatal_err, param;
@@ -1967,6 +2004,7 @@ static const struct pptable_funcs aldebaran_ppt_funcs = {
.i2c_init = aldebaran_i2c_control_init,
.i2c_fini = aldebaran_i2c_control_fini,
.send_hbm_bad_pages_num = aldebaran_smu_send_hbm_bad_page_num,
+   .get_ecc_info = aldebaran_get_ecc_info,
  };
  
  void aldebaran_set_ppt_funcs(struct smu_context *smu)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
index 4d96099a9bb1..55421e

Re: [PATCH Review 4/4] query umc error info from ecc_table

2021-11-17 Thread Lazar, Lijo




On 11/17/2021 3:41 PM, Stanley.Yang wrote:

if smu support ECCTABLE, driver can message smu to get ecc_table
then query umc error info from ECCTABLE
apply pmfw version check to ensure backward compatibility

Signed-off-by: Stanley.Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 42 ---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  7 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   | 71 +--
  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  1 +
  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 12 
  .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  4 ++
  6 files changed, 107 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 90f0db3b4f65..6b0f2ba1e420 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -888,6 +888,38 @@ void amdgpu_ras_mca_query_error_status(struct 
amdgpu_device *adev,
}
  }
  
+static void amdgpu_ras_get_ecc_info(struct amdgpu_device *adev, struct ras_err_data *err_data)

+{
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   /*
+* choosing right query method according to
+* whether smu support query error information
+*/
+   if ((ras->smu_version >= SUPPORT_ECCTABLE_SMU_VERSION) &&
+   !smu_get_ecc_info(&adev->smu, (void *)&(ras->umc_ecc))) 
{
+


This version check should be in aldebaran_ppt implementation. In general 
the callback will check the FW version that supports ECC table for the 
corresponding ASIC. It may return ENOTSUPP or similar if the FW version 
doesn't support ECC table and that may be checked here. Keeping 
smu_version in ras context is not needed.



+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs->message_smu_query_ras_error_count)
+   
adev->umc.ras_funcs->message_smu_query_ras_error_count(adev, err_data);
+
+   if (adev->umc.ras_funcs &&
+   
adev->umc.ras_funcs->message_smu_query_ras_error_address)
+   
adev->umc.ras_funcs->message_smu_query_ras_error_address(adev, err_data);
+   } else {
+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs->query_ras_error_count)
+   adev->umc.ras_funcs->query_ras_error_count(adev, 
err_data);
+
+   /* umc query_ras_error_address is also responsible for clearing
+* error status
+*/
+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs->query_ras_error_address)
+   adev->umc.ras_funcs->query_ras_error_address(adev, 
err_data);
+   }
+}
+
  /* query/inject/cure begin */
  int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
  struct ras_query_if *info)
@@ -901,15 +933,7 @@ int amdgpu_ras_query_error_status(struct amdgpu_device 
*adev,
  
  	switch (info->head.block) {

case AMDGPU_RAS_BLOCK__UMC:
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_count)
-   adev->umc.ras_funcs->query_ras_error_count(adev, 
&err_data);
-   /* umc query_ras_error_address is also responsible for clearing
-* error status
-*/
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_address)
-   adev->umc.ras_funcs->query_ras_error_address(adev, 
&err_data);
+   amdgpu_ras_get_ecc_info(adev, &err_data);
break;
case AMDGPU_RAS_BLOCK__SDMA:
if (adev->sdma.funcs->query_ras_error_count) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index bcbf3264d92f..3f0de0cc8403 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -322,6 +322,12 @@ struct ras_common_if {
  
  #define MAX_UMC_CHANNEL_NUM 32
  
+/*

+ * SMU support ECCTABLE since version 68.42.0,
+ * use this to decide query umc error info method
+ */
+#define SUPPORT_ECCTABLE_SMU_VERSION 0x00442a00
+
  struct ecc_info_per_ch {
uint16_t ce_count_lo_chip;
uint16_t ce_count_hi_chip;
@@ -375,6 +381,7 @@ struct amdgpu_ras {
  
  	/* record umc error info queried from smu */

struct umc_ecc_info umc_ecc;
+   uint32_t smu_version;
  };
  
  struct ras_fs_data {

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 0c7c56a91b25..2c3e97c9410b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -97,28 +97,57 @@ int amdgpu_umc_process_ras_data_cb(struct amdgpu_device 
*adev,
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
  
  	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);

-   if (adev->umc.

[PATCH Review 4/4] query umc error info from ecc_table

2021-11-17 Thread Stanley . Yang
if smu support ECCTABLE, driver can message smu to get ecc_table
then query umc error info from ECCTABLE
apply pmfw version check to ensure backward compatibility

Signed-off-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 42 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  7 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   | 71 +--
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  1 +
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 12 
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  4 ++
 6 files changed, 107 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 90f0db3b4f65..6b0f2ba1e420 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -888,6 +888,38 @@ void amdgpu_ras_mca_query_error_status(struct 
amdgpu_device *adev,
}
 }
 
+static void amdgpu_ras_get_ecc_info(struct amdgpu_device *adev, struct 
ras_err_data *err_data)
+{
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   /*
+* choosing right query method according to
+* whether smu support query error information
+*/
+   if ((ras->smu_version >= SUPPORT_ECCTABLE_SMU_VERSION) &&
+   !smu_get_ecc_info(&adev->smu, (void *)&(ras->umc_ecc))) 
{
+
+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs->message_smu_query_ras_error_count)
+   
adev->umc.ras_funcs->message_smu_query_ras_error_count(adev, err_data);
+
+   if (adev->umc.ras_funcs &&
+   
adev->umc.ras_funcs->message_smu_query_ras_error_address)
+   
adev->umc.ras_funcs->message_smu_query_ras_error_address(adev, err_data);
+   } else {
+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs->query_ras_error_count)
+   adev->umc.ras_funcs->query_ras_error_count(adev, 
err_data);
+
+   /* umc query_ras_error_address is also responsible for clearing
+* error status
+*/
+   if (adev->umc.ras_funcs &&
+   adev->umc.ras_funcs->query_ras_error_address)
+   adev->umc.ras_funcs->query_ras_error_address(adev, 
err_data);
+   }
+}
+
 /* query/inject/cure begin */
 int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
  struct ras_query_if *info)
@@ -901,15 +933,7 @@ int amdgpu_ras_query_error_status(struct amdgpu_device 
*adev,
 
switch (info->head.block) {
case AMDGPU_RAS_BLOCK__UMC:
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_count)
-   adev->umc.ras_funcs->query_ras_error_count(adev, 
&err_data);
-   /* umc query_ras_error_address is also responsible for clearing
-* error status
-*/
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_address)
-   adev->umc.ras_funcs->query_ras_error_address(adev, 
&err_data);
+   amdgpu_ras_get_ecc_info(adev, &err_data);
break;
case AMDGPU_RAS_BLOCK__SDMA:
if (adev->sdma.funcs->query_ras_error_count) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index bcbf3264d92f..3f0de0cc8403 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -322,6 +322,12 @@ struct ras_common_if {
 
 #define MAX_UMC_CHANNEL_NUM 32
 
+/*
+ * SMU support ECCTABLE since version 68.42.0,
+ * use this to decide query umc error info method
+ */
+#define SUPPORT_ECCTABLE_SMU_VERSION 0x00442a00
+
 struct ecc_info_per_ch {
uint16_t ce_count_lo_chip;
uint16_t ce_count_hi_chip;
@@ -375,6 +381,7 @@ struct amdgpu_ras {
 
/* record umc error info queried from smu */
struct umc_ecc_info umc_ecc;
+   uint32_t smu_version;
 };
 
 struct ras_fs_data {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 0c7c56a91b25..2c3e97c9410b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -97,28 +97,57 @@ int amdgpu_umc_process_ras_data_cb(struct amdgpu_device 
*adev,
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 
kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_count)
-   adev->umc.ras_funcs->query_ras_error_count(adev, ras_error_status);
 
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_address &&
-   adev->umc.max_ras_err_cnt_per_query) {
-   err_data->err_addr =
-   kcalloc(adev->umc.max_ras_err_cn

[PATCH Review 2/4] drm/amdgpu: add new query interface for umc block

2021-11-17 Thread Stanley . Yang
add message smu to query error information

Signed-off-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  16 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |   4 +
 drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   | 161 
 3 files changed, 181 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index cdd0010a5389..bcbf3264d92f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -320,6 +320,19 @@ struct ras_common_if {
char name[32];
 };
 
+#define MAX_UMC_CHANNEL_NUM 32
+
+struct ecc_info_per_ch {
+   uint16_t ce_count_lo_chip;
+   uint16_t ce_count_hi_chip;
+   uint64_t mca_umc_status;
+   uint64_t mca_umc_addr;
+};
+
+struct umc_ecc_info {
+   struct ecc_info_per_ch ecc[MAX_UMC_CHANNEL_NUM];
+};
+
 struct amdgpu_ras {
/* ras infrastructure */
/* for ras itself. */
@@ -359,6 +372,9 @@ struct amdgpu_ras {
struct delayed_work ras_counte_delay_work;
atomic_t ras_ue_count;
atomic_t ras_ce_count;
+
+   /* record umc error info queried from smu */
+   struct umc_ecc_info umc_ecc;
 };
 
 struct ras_fs_data {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 1f5fe2315236..7aa9b21eb906 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -49,6 +49,10 @@ struct amdgpu_umc_ras_funcs {
void (*query_ras_error_address)(struct amdgpu_device *adev,
void *ras_error_status);
bool (*query_ras_poison_mode)(struct amdgpu_device *adev);
+   void (*message_smu_query_ras_error_count)(struct amdgpu_device *adev,
+ void *ras_error_status);
+   void (*message_smu_query_ras_error_address)(struct amdgpu_device *adev,
+   void *ras_error_status);
 };
 
 struct amdgpu_umc_funcs {
diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c 
b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
index f7ec3fe134e5..cd96e8b734cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
+++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
@@ -50,6 +50,165 @@ static inline uint32_t get_umc_v6_7_reg_offset(struct 
amdgpu_device *adev,
return adev->umc.channel_offs * ch_inst + UMC_V6_7_INST_DIST * umc_inst;
 }
 
+static inline uint32_t get_umc_v6_7_channel_index(struct amdgpu_device *adev,
+ uint32_t umc_inst,
+ uint32_t ch_inst)
+{
+   return adev->umc.channel_idx_tbl[umc_inst * adev->umc.channel_inst_num 
+ ch_inst];
+}
+
+static void umc_v6_7_message_smu_query_correctable_error_count(struct 
amdgpu_device *adev,
+  uint32_t channel_index,
+  unsigned long *error_count)
+{
+   uint32_t ecc_err_cnt;
+   uint64_t mc_umc_status;
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   /*
+* select the lower chip and check the error count
+* skip add error count, calc error counter only from mca_umc_status
+*/
+   ecc_err_cnt = ras->umc_ecc.ecc[channel_index].ce_count_lo_chip;
+
+   /*
+* select the higher chip and check the err counter
+* skip add error count, calc error counter only from mca_umc_status
+*/
+   ecc_err_cnt = ras->umc_ecc.ecc[channel_index].ce_count_hi_chip;
+
+   /* check for SRAM correctable error
+ MCUMC_STATUS is a 64 bit register */
+   mc_umc_status = ras->umc_ecc.ecc[channel_index].mca_umc_status;
+   if (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1 
&&
+   REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 
1)
+   *error_count += 1;
+}
+
+static void umc_v6_7_message_smu_querry_uncorrectable_error_count(struct 
amdgpu_device *adev,
+ uint32_t channel_index,
+ unsigned long 
*error_count)
+{
+   uint64_t mc_umc_status;
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   /* check the MCUMC_STATUS */
+   mc_umc_status = ras->umc_ecc.ecc[channel_index].mca_umc_status;
+   if ((REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 
1) &&
+   (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, 
Deferred) == 1 ||
+   REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, UECC) == 
1 ||
+   REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, PCC) == 1 
||
+   REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, UC) == 1 
||
+   REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, TCC) == 
1))
+   *error_count += 1;
+}
+
+static void umc_v6_7_message_sm

[PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table

2021-11-17 Thread Stanley . Yang
support ECC TABLE message, this table include unc ras error count
and error address

Signed-off-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38 +++
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 24 
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
 5 files changed, 74 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 3557f4e7fc30..ea65de0160c3 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -324,6 +324,7 @@ enum smu_table_id
SMU_TABLE_OVERDRIVE,
SMU_TABLE_I2C_COMMANDS,
SMU_TABLE_PACE,
+   SMU_TABLE_ECCINFO,
SMU_TABLE_COUNT,
 };
 
@@ -340,6 +341,7 @@ struct smu_table_context
void*max_sustainable_clocks;
struct smu_bios_boot_up_values  boot_values;
void*driver_pptable;
+   void*ecc_table;
struct smu_tabletables[SMU_TABLE_COUNT];
/*
 * The driver table is just a staging buffer for
@@ -1261,6 +1263,11 @@ struct pptable_funcs {
 *  
of SMUBUS table.
 */
int (*send_hbm_bad_pages_num)(struct smu_context *smu, uint32_t size);
+
+   /**
+* @get_ecc_table:  message SMU to get ECC INFO table.
+*/
+   ssize_t (*get_ecc_info)(struct smu_context *smu, void *table);
 };
 
 typedef enum {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index f835d86cc2f5..5e4ba0e14a91 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -190,6 +190,7 @@ static const struct cmn2asic_mapping 
aldebaran_table_map[SMU_TABLE_COUNT] = {
TAB_MAP(SMU_METRICS),
TAB_MAP(DRIVER_SMU_CONFIG),
TAB_MAP(I2C_COMMANDS),
+   TAB_MAP(ECCINFO),
 };
 
 static const uint8_t aldebaran_throttler_map[] = {
@@ -223,6 +224,9 @@ static int aldebaran_tables_init(struct smu_context *smu)
SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS, sizeof(SwI2cRequest_t),
   PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
 
+   SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO, sizeof(EccInfoTable_t),
+  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
+
smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t), GFP_KERNEL);
if (!smu_table->metrics_table)
return -ENOMEM;
@@ -235,6 +239,10 @@ static int aldebaran_tables_init(struct smu_context *smu)
return -ENOMEM;
}
 
+   smu_table->ecc_table = kzalloc(tables[SMU_TABLE_ECCINFO].size, 
GFP_KERNEL);
+   if (!smu_table->ecc_table)
+   return -ENOMEM;
+
return 0;
 }
 
@@ -1765,6 +1773,35 @@ static ssize_t aldebaran_get_gpu_metrics(struct 
smu_context *smu,
return sizeof(struct gpu_metrics_v1_3);
 }
 
+static ssize_t aldebaran_get_ecc_info(struct smu_context *smu,
+void *table)
+{
+   struct smu_table_context *smu_table = &smu->smu_table;
+   EccInfoTable_t ecc_table;
+   struct ecc_info_per_ch *ecc_info_per_channel = NULL;
+   int i, ret = 0;
+   struct umc_ecc_info *eccinfo = (struct umc_ecc_info *)table;
+
+   ret = smu_cmn_get_ecc_info_table(smu,
+   &ecc_table);
+   if (ret)
+   return ret;
+
+   for (i = 0; i < ALDEBARAN_UMC_CHANNEL_NUM; i++) {
+   ecc_info_per_channel = &(eccinfo->ecc[i]);
+   ecc_info_per_channel->ce_count_lo_chip =
+   ecc_table.EccInfo[i].ce_count_lo_chip;
+   ecc_info_per_channel->ce_count_hi_chip =
+   ecc_table.EccInfo[i].ce_count_hi_chip;
+   ecc_info_per_channel->mca_umc_status =
+   ecc_table.EccInfo[i].mca_umc_status;
+   ecc_info_per_channel->mca_umc_addr =
+   ecc_table.EccInfo[i].mca_umc_addr;
+   }
+
+   return ret;
+}
+
 static int aldebaran_mode1_reset(struct smu_context *smu)
 {
u32 smu_version, fatal_err, param;
@@ -1967,6 +2004,7 @@ static const struct pptable_funcs aldebaran_ppt_funcs = {
.i2c_init = aldebaran_i2c_control_init,
.i2c_fini = aldebaran_i2c_control_fini,
.send_hbm_bad_pages_num = aldebaran_smu_send_hbm_bad_page_num,
+   .get_ecc_info = aldebaran_get_ecc_info,
 };
 
 void aldebaran_set_ppt_funcs(struct smu_context *smu)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
index 4d96099a9bb1..55421ea622fb 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c

[PATCH Review 1/4] drm/amdgpu: Update smu driver interface for aldebaran

2021-11-17 Thread Stanley . Yang
update smu driver if version to 0x08 to avoid mismatch log
A version mismatch can still happen with an older FW

Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang 
---
 .../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +-
 drivers/gpu/drm/amd/pm/inc/smu_v13_0.h |  2 +-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/smu13_driver_if_aldebaran.h 
b/drivers/gpu/drm/amd/pm/inc/smu13_driver_if_aldebaran.h
index a017983ff1fa..0f67c56c2863 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu13_driver_if_aldebaran.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu13_driver_if_aldebaran.h
@@ -140,6 +140,8 @@
 
 #define MAX_SW_I2C_COMMANDS24
 
+#define ALDEBARAN_UMC_CHANNEL_NUM32
+
 typedef enum {
   I2C_CONTROLLER_PORT_0, //CKSVII2C0
   I2C_CONTROLLER_PORT_1, //CKSVII2C1
@@ -507,6 +509,19 @@ typedef struct {
   uint32_t MmHubPadding[8]; // SMU internal use
 } AvfsDebugTable_t;
 
+typedef struct {
+   uint64_t mca_umc_status;
+   uint64_t mca_umc_addr;
+   uint16_t ce_count_lo_chip;
+   uint16_t ce_count_hi_chip;
+
+   uint32_t eccPadding;
+} EccInfo_t;
+
+typedef struct {
+   EccInfo_t  EccInfo[ALDEBARAN_UMC_CHANNEL_NUM];
+} EccInfoTable_t;
+
 // These defines are used with the following messages:
 // SMC_MSG_TransferTableDram2Smu
 // SMC_MSG_TransferTableSmu2Dram
@@ -517,6 +532,7 @@ typedef struct {
 #define TABLE_SMU_METRICS 4
 #define TABLE_DRIVER_SMU_CONFIG   5
 #define TABLE_I2C_COMMANDS6
-#define TABLE_COUNT   7
+#define TABLE_ECCINFO 7
+#define TABLE_COUNT   8
 
 #endif
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h 
b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
index bbc608c990b0..44af23ae059e 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
@@ -27,7 +27,7 @@
 
 #define SMU13_DRIVER_IF_VERSION_INV 0x
 #define SMU13_DRIVER_IF_VERSION_YELLOW_CARP 0x04
-#define SMU13_DRIVER_IF_VERSION_ALDE 0x07
+#define SMU13_DRIVER_IF_VERSION_ALDE 0x08
 
 #define SMU13_MODE1_RESET_WAIT_TIME_IN_MS 500  //500ms
 
-- 
2.17.1



Re: [PATCH v3 1/6] drm: move the buddy allocator from i915 into common drm

2021-11-17 Thread Christian König
I've looked a bit more into this and I think we should just follow 
Thomas Zimmermann's idea to make this a separate module.


Otherwise we just have the code around all the time even if it is unused 
and implementing this should be trivial.


See how DRM_GEM_CMA_HELPER or DRM_VRAM_HELPER are done in 
drivers/gpu/drm/Kconfig and then have the desired effect in 
drivers/gpu/drm/Makefile


Should be something like ~15 lines of additional configuration which 
saves a bit of memory on embedded systems which just need a fbdev 
replacement.


Thanks,
Christian.

Am 16.11.21 um 21:18 schrieb Arunpravin:

Move the base i915 buddy allocator code into drm
- Move i915_buddy.h to include/drm
- Move i915_buddy.c to drm root folder
- Rename "i915" string with "drm" string wherever applicable
- Rename "I915" string with "DRM" string wherever applicable
- Fix header file dependencies
- Fix alignment issues
- add Makefile support for drm buddy
- export functions and write kerneldoc description
- Remove i915 selftest config check condition as buddy selftest
   will be moved to drm selftest folder

cleanup i915 buddy references in i915 driver module
and replace with drm buddy

v2:
   - include header file in alphabetical order (Thomas)
   - merged changes listed in the body section into a single patch
 to keep the build intact (Christian, Jani)

Signed-off-by: Arunpravin 
---
  drivers/gpu/drm/Makefile  |   2 +-
  drivers/gpu/drm/drm_buddy.c   | 521 ++
  drivers/gpu/drm/drm_drv.c |   3 +
  drivers/gpu/drm/i915/Makefile |   1 -
  drivers/gpu/drm/i915/i915_buddy.c | 466 
  drivers/gpu/drm/i915/i915_buddy.h | 143 -
  drivers/gpu/drm/i915/i915_module.c|   3 -
  drivers/gpu/drm/i915/i915_scatterlist.c   |  11 +-
  drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  33 +-
  drivers/gpu/drm/i915/i915_ttm_buddy_manager.h |   4 +-
  include/drm/drm_buddy.h   | 153 +
  11 files changed, 702 insertions(+), 638 deletions(-)
  create mode 100644 drivers/gpu/drm/drm_buddy.c
  delete mode 100644 drivers/gpu/drm/i915/i915_buddy.c
  delete mode 100644 drivers/gpu/drm/i915/i915_buddy.h
  create mode 100644 include/drm/drm_buddy.h

diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 0dff40bb863c..dc61e91a3154 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -18,7 +18,7 @@ drm-y   :=drm_aperture.o drm_auth.o drm_cache.o \
drm_dumb_buffers.o drm_mode_config.o drm_vblank.o \
drm_syncobj.o drm_lease.o drm_writeback.o drm_client.o \
drm_client_modeset.o drm_atomic_uapi.o drm_hdcp.o \
-   drm_managed.o drm_vblank_work.o
+   drm_managed.o drm_vblank_work.o drm_buddy.o
  
  drm-$(CONFIG_DRM_LEGACY) += drm_agpsupport.o drm_bufs.o drm_context.o drm_dma.o \

drm_legacy_misc.o drm_lock.o drm_memory.o 
drm_scatter.o \
diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
new file mode 100644
index ..39eb1d224bec
--- /dev/null
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -0,0 +1,521 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2021 Intel Corporation
+ */
+
+#include 
+#include 
+
+#include 
+
+static struct kmem_cache *slab_blocks;
+
+static struct drm_buddy_block *drm_block_alloc(struct drm_buddy_mm *mm,
+  struct drm_buddy_block *parent,
+  unsigned int order,
+  u64 offset)
+{
+   struct drm_buddy_block *block;
+
+   BUG_ON(order > DRM_BUDDY_MAX_ORDER);
+
+   block = kmem_cache_zalloc(slab_blocks, GFP_KERNEL);
+   if (!block)
+   return NULL;
+
+   block->header = offset;
+   block->header |= order;
+   block->parent = parent;
+
+   BUG_ON(block->header & DRM_BUDDY_HEADER_UNUSED);
+   return block;
+}
+
+static void drm_block_free(struct drm_buddy_mm *mm,
+  struct drm_buddy_block *block)
+{
+   kmem_cache_free(slab_blocks, block);
+}
+
+static void mark_allocated(struct drm_buddy_block *block)
+{
+   block->header &= ~DRM_BUDDY_HEADER_STATE;
+   block->header |= DRM_BUDDY_ALLOCATED;
+
+   list_del(&block->link);
+}
+
+static void mark_free(struct drm_buddy_mm *mm,
+ struct drm_buddy_block *block)
+{
+   block->header &= ~DRM_BUDDY_HEADER_STATE;
+   block->header |= DRM_BUDDY_FREE;
+
+   list_add(&block->link,
+&mm->free_list[drm_buddy_block_order(block)]);
+}
+
+static void mark_split(struct drm_buddy_block *block)
+{
+   block->header &= ~DRM_BUDDY_HEADER_STATE;
+   block->header |= DRM_BUDDY_SPLIT;
+
+   list_del(&block->link);
+}
+
+/**
+ * drm_buddy_init - init memory manager
+ *
+ * @mm: DRM buddy manager to initialize
+