date:20221021

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Luben Tuikov

On 2022-10-21 05:12, Yang Yingliang wrote:
> 
> On 2022/10/21 16:36, Greg KH wrote:
>> On Fri, Oct 21, 2022 at 04:24:23PM +0800, Yang Yingliang wrote:
>>> On 2022/10/21 13:37, Greg KH wrote:
 On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
> On 2022-10-20 22:20, Yang Yingliang wrote:
>> The previous discussion link:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7C26ed7dc8053f4793d54d08dab344731e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019403819761348%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PD93EC%2FcBmkfSBbdmK8FNtXhqS%2FKmmcByfkx5lqQfpY%3D&reserved=0
> The very first discussion on this was here:
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7C26ed7dc8053f4793d54d08dab344731e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019403819761348%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=k0fTSmAPTnLFCe4zN4z%2FY1Z7CvwO4gR2vgj%2FLH%2FSRRk%3D&reserved=0
>
> Please use this link, and not the that one up there you which quoted 
> above,
> and whose commit description is taken verbatim from the this link.
>
>> kset_register() is currently used in some places without calling
>> kset_put() in error path, because the callers think it should be
>> kset internal thing to do, but the driver core can not know what
>> caller doing with that memory at times. The memory could be freed
>> both in kset_put() and error path of caller, if it is called in
>> kset_register().
> As I explained in the link above, the reason there's
> a memory leak is that one cannot call kset_register() without
> the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
> in this case, i.e. kset_register() fails with -EINVAL.
>
> Thus, the most common usage is something like this:
>
>   kobj_set_name(&kset->kobj, format, ...);
>   kset->kobj.kset = parent_kset;
>   kset->kobj.ktype = ktype;
>   res = kset_register(kset);
>
> So, what is being leaked, is the memory allocated in kobj_set_name(),
> by the common idiom shown above. This needs to be mentioned in
> the documentation, at least, in case, in the future this is absolved
> in kset_register() redesign, etc.
 Based on this, can kset_register() just clean up from itself when an
 error happens?  Ideally that would be the case, as the odds of a kset
 being embedded in a larger structure is probably slim, but we would have
 to search the tree to make sure.
>>> I have search the whole tree, the kset used in bus_register() - patch #3,
>>> kset_create_and_add() - patch #4
>>> __class_register() - patch #5,  fw_cfg_build_symlink() - patch #6 and
>>> amdgpu_discovery.c - patch #10
>>> is embedded in a larger structure. In these cases, we can not call
>>> kset_put() in error path in kset_register()
>> Yes you can as the kobject in the kset should NOT be controling the
>> lifespan of those larger objects.
> Read through the code the only leak in this case is the name, so can we 
> free it
> directly in kset_register():
> 
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -844,8 +844,11 @@ int kset_register(struct kset *k)
> 
>      kset_init(k);
>      err = kobject_add_internal(&k->kobj);
> -   if (err)
> +   if (err) {
> +   kfree_const(k->kobj.name);
> +   k->kobj.name = NULL;
>      return err;
> +   }
>      kobject_uevent(&k->kobj, KOBJ_ADD);
>      return 0;
>   }

This may work, but absolutely needs to be documented since we don't
exactly know how the name was allocated by the caller! FWIW, the caller
may have set the name pointer to point to a static array of strings...

> or unset ktype of kobject, then call kset_put():
> 
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -844,8 +844,11 @@ int kset_register(struct kset *k)
> 
>      kset_init(k);
>      err = kobject_add_internal(&k->kobj);
> -   if (err)
> +   if (err) {
> +   k->kobj.ktype = NULL;
> +   kset_put(k);
>      return err;
> +   }
>      kobject_uevent(&k->kobj, KOBJ_ADD);
>      return 0;
>   }

That's a no. You shouldn't set the ktype to NULL--maybe the caller is relying 
on it...

Regards,
Luben

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Luben Tuikov

On 2022-10-21 05:56, Yang Yingliang wrote:
> 
> On 2022/10/21 17:08, Luben Tuikov wrote:
>> On 2022-10-21 04:59, Yang Yingliang wrote:
>>> On 2022/10/21 16:36, Greg KH wrote:
 On Fri, Oct 21, 2022 at 04:24:23PM +0800, Yang Yingliang wrote:
> On 2022/10/21 13:37, Greg KH wrote:
>> On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
>>> On 2022-10-20 22:20, Yang Yingliang wrote:
 The previous discussion link:
 https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7C2597f1097c204be54c7c08dab34a8654%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019429914730071%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NNVCtbTxI5uzxxJA9mKvnsy8d3jyudtl1u4CTcm3tsU%3D&reserved=0
>>> The very first discussion on this was here:
>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7C2597f1097c204be54c7c08dab34a8654%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019429914886316%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ByCQk0qGktyyoNQg8IFj5AGxmaeWOXnbIA4rFnX%2B6%2BA%3D&reserved=0
>>>
>>> Please use this link, and not the that one up there you which quoted 
>>> above,
>>> and whose commit description is taken verbatim from the this link.
>>>
 kset_register() is currently used in some places without calling
 kset_put() in error path, because the callers think it should be
 kset internal thing to do, but the driver core can not know what
 caller doing with that memory at times. The memory could be freed
 both in kset_put() and error path of caller, if it is called in
 kset_register().
>>> As I explained in the link above, the reason there's
>>> a memory leak is that one cannot call kset_register() without
>>> the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
>>> in this case, i.e. kset_register() fails with -EINVAL.
>>>
>>> Thus, the most common usage is something like this:
>>>
>>> kobj_set_name(&kset->kobj, format, ...);
>>> kset->kobj.kset = parent_kset;
>>> kset->kobj.ktype = ktype;
>>> res = kset_register(kset);
>>>
>>> So, what is being leaked, is the memory allocated in kobj_set_name(),
>>> by the common idiom shown above. This needs to be mentioned in
>>> the documentation, at least, in case, in the future this is absolved
>>> in kset_register() redesign, etc.
>> Based on this, can kset_register() just clean up from itself when an
>> error happens?  Ideally that would be the case, as the odds of a kset
>> being embedded in a larger structure is probably slim, but we would have
>> to search the tree to make sure.
> I have search the whole tree, the kset used in bus_register() - patch #3,
> kset_create_and_add() - patch #4
> __class_register() - patch #5,  fw_cfg_build_symlink() - patch #6 and
> amdgpu_discovery.c - patch #10
> is embedded in a larger structure. In these cases, we can not call
> kset_put() in error path in kset_register()
 Yes you can as the kobject in the kset should NOT be controling the
 lifespan of those larger objects.

 If it is, please point out the call chain here as I don't think that
 should be possible.

 Note all of this is a mess because the kobject name stuff was added much
 later, after the driver model had been created and running for a while.
 We missed this error path when adding the dynamic kobject name logic,
 thank for looking into this.

 If you could test the patch posted with your error injection systems,
 that could make this all much simpler to solve.
>>> The patch posted by Luben will cause double free in some cases.
>> Yes, I figured this out in the other email and posted the scenario Greg
>> was asking about.
>>
>> But I believe the question still stands if we can do kset_put()
>> after a *failed* kset_register(), namely if more is being done than
>> necessary, which is just to free the memory allocated by
>> kobject_set_name().
> The name memory is allocated in kobject_set_name() in caller,  and I 
> think caller
> free the memory that it allocated is reasonable, it's weird that some 
> callers allocate
> some memory and use function (kset_register) failed, then it free the 
> memory allocated
> in callers,  I think use kset_put()/kfree_const(name) in caller seems 
> more reasonable.

kset_put() would work only in implementations, such as amdgpu_discovery.c,
where the ktype.release is defined and it frees the embedding object in
which the kset is embedded.

Depending on the implemen

Re: [PATCH 1/2] drm/amdkfd: introduce dummy cache info for property asic

2022-10-21 Thread Felix Kuehling

On 2022-10-21 09:05, Liang, Prike wrote:

[Public]

-Original Message-
From: Kuehling, Felix 
Sent: Friday, October 21, 2022 1:11 PM
To: Liang, Prike ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Zhang, Yifan ; 
Huang, Ray ; Liu, Aaron 
Subject: Re: [PATCH 1/2] drm/amdkfd: introduce dummy cache info for property 
asic

Am 2022-10-20 um 21:50 schrieb Liang, Prike:

[Public]

-Original Message-
From: Kuehling, Felix 
Sent: Friday, October 21, 2022 12:03 AM
To: Liang, Prike ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Zhang, Yifan
; Huang, Ray ; Liu, Aaron

Subject: Re: [PATCH 1/2] drm/amdkfd: introduce dummy cache info for
property asic

Am 2022-10-20 um 05:15 schrieb Prike Liang:

This dummy cache info will enable kfd base function support.

Signed-off-by: Prike Liang 
---
drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 55 +--
1 file changed, 52 insertions(+), 3 deletions(-)

[snip]

@@ -1528,7 +1574,10 @@ static int kfd_fill_gpu_cache_info(struct kfd_dev *kdev,
kfd_fill_gpu_cache_info_from_gfx_config(kdev, 
pcache_info);
break;
default:
- return -EINVAL;
+ pcache_info = dummy_cache_info;
+ num_of_cache_types = ARRAY_SIZE(dummy_cache_info);
+ pr_warn("dummy cache info is used temporarily and real cache 
info need update later.\n");
+ break;

Could we make this return an error if the amdgpu.exp_hw_support module 
parameter is not set?

Regards,
 Felix

[Prike] It's fine to protect this dummy info by checking the parameter 
amdgpu_exp_hw_support, but it may not friendly to end user by adding the 
parameter and some guys will still report KFD not enabled for this parameter 
setting problem. The original idea is the end user will not aware the dummy 
cache info and only alert the warning message to developer.

I thought the intention was to simplify bring-up but make sure that valid cache 
info is available by the time a chip goes into production.
Therefore, normal end users should never need to set the amdgpu_exp_hw_support 
option. I think you're saying that we would go to production with dummy info. 
That seems like a bad idea to me.

Regards,
Felix

[Prike]  Sorry for the confusion. In fact, this dummy cache info only used 
before production and meanwhile needn't require any parameter setting for CQE 
do KFD test. Anyway if you still have concern on this solution I will add the 
condition of checking amdgpu_exp_hw_support.

The idea to control this with a module parameter was to cause a more 
obvious failure if we don't have correct cache info before going to 
production. Just a warning in the log file is too easy to miss or 
ignore. Of course, if QA gets into the habit of testing with 
amdgpu_exp_hw_support, then this may not solve the problem. You need to 
have at least one test pass without amdgpu_exp_hw_support to catch 
missing cache info.

Regards,
  Felix

Re: [PATCH] drm/amdkfd: correct the cache info for gfx1036

2022-10-21 Thread Alex Deucher

It looks like this patch never landed.

Alex

On Tue, Oct 11, 2022 at 9:48 PM Zhang, Yifan  wrote:
>
> [Public]
>
>
>
> This patch is
>
>
>
> Reviewed-by: Yifan Zhang 
>
>
>
> From: Zhang, Jesse(Jie) 
> Sent: Tuesday, October 11, 2022 1:23 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Zhang, Yifan 
> ; Kuehling, Felix 
> Subject: RE: [PATCH] drm/amdkfd: correct the cache info for gfx1036
>
>
>
> [AMD Official Use Only - General]
>
>
>
>
>
>correct the cache information for gfx1036
>
>
>
> Signed-off-by: Yifan Zhang yifan1.zh...@amd.com
>
>
>
> Signed-off-by: jie1zhan jesse.zh...@amd.com
>
> Change-Id: I60e754737057c144e69a6511ba6ddfca472ca7a1
>
>
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>
> index 477a30981c1b..d25ac9cbe5b2 100644
>
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>
> @@ -795,6 +795,54 @@ static struct kfd_gpu_cache_info 
> yellow_carp_cache_info[] = {
>
> },
>
> };
>
>
>
> +static struct kfd_gpu_cache_info gc_10_3_6_cache_info[] = {
>
> +   {
>
> +   /* TCP L1 Cache per CU */
>
> +   .cache_size = 16,
>
> +   .cache_level = 1,
>
> +   .flags = (CRAT_CACHE_FLAGS_ENABLED |
>
> +   CRAT_CACHE_FLAGS_DATA_CACHE |
>
> +   CRAT_CACHE_FLAGS_SIMD_CACHE),
>
> +   .num_cu_shared = 1,
>
> +   },
>
> +   {
>
> +   /* Scalar L1 Instruction Cache per SQC */
>
> +   .cache_size = 32,
>
> +   .cache_level = 1,
>
> +   .flags = (CRAT_CACHE_FLAGS_ENABLED |
>
> +   CRAT_CACHE_FLAGS_INST_CACHE |
>
> +   CRAT_CACHE_FLAGS_SIMD_CACHE),
>
> +   .num_cu_shared = 2,
>
> +   },
>
> +   {
>
> +   /* Scalar L1 Data Cache per SQC */
>
> +   .cache_size = 16,
>
> +   .cache_level = 1,
>
> +   .flags = (CRAT_CACHE_FLAGS_ENABLED |
>
> +   CRAT_CACHE_FLAGS_DATA_CACHE |
>
> +   CRAT_CACHE_FLAGS_SIMD_CACHE),
>
> +   .num_cu_shared = 2,
>
> +   },
>
> +   {
>
> +   /* GL1 Data Cache per SA */
>
> +   .cache_size = 128,
>
> +   .cache_level = 1,
>
> +   .flags = (CRAT_CACHE_FLAGS_ENABLED |
>
> +   CRAT_CACHE_FLAGS_DATA_CACHE |
>
> +   CRAT_CACHE_FLAGS_SIMD_CACHE),
>
> +   .num_cu_shared = 2,
>
> +   },
>
> +   {
>
> +   /* L2 Data Cache per GPU (Total Tex Cache) */
>
> +   .cache_size = 256,
>
> +   .cache_level = 2,
>
> +   .flags = (CRAT_CACHE_FLAGS_ENABLED |
>
> +   CRAT_CACHE_FLAGS_DATA_CACHE |
>
> +   CRAT_CACHE_FLAGS_SIMD_CACHE),
>
> +   .num_cu_shared = 2,
>
> +   },
>
> +};
>
> +
>
> static void kfd_populated_cu_info_cpu(struct kfd_topology_device *dev,
>
> struct crat_subtype_computeunit *cu)
>
> {
>
> @@ -1514,11 +1562,14 @@ static int kfd_fill_gpu_cache_info(struct kfd_dev 
> *kdev,
>
> num_of_cache_types = 
> ARRAY_SIZE(beige_goby_cache_info);
>
> break;
>
> case IP_VERSION(10, 3, 3):
>
> -   case IP_VERSION(10, 3, 6): /* TODO: Double check these on 
> production silicon */
>
> case IP_VERSION(10, 3, 7): /* TODO: Double check these on 
> production silicon */
>
> pcache_info = yellow_carp_cache_info;
>
> num_of_cache_types = 
> ARRAY_SIZE(yellow_carp_cache_info);
>
> break;
>
> +   case IP_VERSION(10, 3, 6):
>
> +   pcache_info = gc_10_3_6_cache_info;
>
> +   num_of_cache_types = ARRAY_SIZE(gc_10_3_6_cache_info);
>
> +   break;
>
> case IP_VERSION(11, 0, 0):

[PATCH] drm/amd/display: don't print messages that contain %f in dml

2022-10-21 Thread Hamza Mahfooz

Unfortunately, printk() doesn't currently support the printing of %f
entries. So, print statements that contain "%f" should be removed.
However, since DC is used on other OSes that can still benefit from the
additional debugging information, we should instead remove the
problematic print statements at compile time.

Reported-by: Jim Cromie 
Signed-off-by: Hamza Mahfooz 
---
 drivers/gpu/drm/amd/display/include/logger_types.h | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/include/logger_types.h 
b/drivers/gpu/drm/amd/display/include/logger_types.h
index 3bf08a60c45c..f80630adb5f0 100644
--- a/drivers/gpu/drm/amd/display/include/logger_types.h
+++ b/drivers/gpu/drm/amd/display/include/logger_types.h
@@ -30,6 +30,12 @@
 
 #define MAX_NAME_LEN 32
 
+#define __DC_LOG_IGNORE_FLOATS(fmt, args...)   \
+do {   \
+   if (!strstr((fmt), "%f"))   \
+   pr_debug(fmt, ##args);  \
+} while (0)
+
 #define DC_LOG_ERROR(...) DRM_ERROR(__VA_ARGS__)
 #define DC_LOG_WARNING(...) DRM_WARN(__VA_ARGS__)
 #define DC_LOG_DEBUG(...) DRM_DEBUG_KMS(__VA_ARGS__)
@@ -48,7 +54,8 @@
 #define DC_LOG_MST(...) DRM_DEBUG_KMS(__VA_ARGS__)
 #define DC_LOG_SCALER(...) pr_debug("[SCALER]:"__VA_ARGS__)
 #define DC_LOG_BIOS(...) pr_debug("[BIOS]:"__VA_ARGS__)
-#define DC_LOG_BANDWIDTH_CALCS(...) pr_debug("[BANDWIDTH_CALCS]:"__VA_ARGS__)
+#define DC_LOG_BANDWIDTH_CALCS(args...) \
+   __DC_LOG_IGNORE_FLOATS("[BANDWIDTH_CALCS]:" args)
 #define DC_LOG_BANDWIDTH_VALIDATION(...) DRM_DEBUG_KMS(__VA_ARGS__)
 #define DC_LOG_I2C_AUX(...) DRM_DEBUG_KMS(__VA_ARGS__)
 #define DC_LOG_SYNC(...) DRM_DEBUG_KMS(__VA_ARGS__)
@@ -57,7 +64,7 @@
 #define DC_LOG_DETECTION_EDID_PARSER(...) DRM_DEBUG_KMS(__VA_ARGS__)
 #define DC_LOG_DETECTION_DP_CAPS(...) DRM_DEBUG_KMS(__VA_ARGS__)
 #define DC_LOG_RESOURCE(...) DRM_DEBUG_KMS(__VA_ARGS__)
-#define DC_LOG_DML(...) pr_debug("[DML]:"__VA_ARGS__)
+#define DC_LOG_DML(args...) __DC_LOG_IGNORE_FLOATS("[DML]:" args)
 #define DC_LOG_EVENT_MODE_SET(...) DRM_DEBUG_KMS(__VA_ARGS__)
 #define DC_LOG_EVENT_DETECTION(...) DRM_DEBUG_KMS(__VA_ARGS__)
 #define DC_LOG_EVENT_LINK_TRAINING(...) DRM_DEBUG_KMS(__VA_ARGS__)
-- 
2.38.0

Re: [PATCH] drm/amd/display: Remove duplicate code for DCN314 DML calculation

2022-10-21 Thread Harry Wentland




On 2022-10-20 18:10, Rafael Mendonca wrote:
> This is an extension of commit fd3bc691fc7b ("drm/amd/display: Remove
> duplicate code across dcn30 and dcn31"), which removed duplicate code for
> the function CalculateBytePerPixelAnd256BBlockSizes() across dcn30 and
> dcn31. At the time the aforementioned commit was introduced, support for
> DCN 3.1.4 was still not merged. Thus, this deletes duplicate code for
> CalculateBytePerPixelAnd256BBlockSizes(), that was introduced later in
> DCN314, in favor of using the respective functionality from
> 'display_mode_vba_30.h'.
> 
> Additionally, by doing that, we also fix a duplicate argument issue
> reported by coccinelle in 'display_rq_dlg_calc_314.c':
> 
>   static bool CalculateBytePerPixelAnd256BBlockSizes(...) {
> ...
> } else if (SourcePixelFormat == dm_444_16 || SourcePixelFormat == 
> dm_444_16) {
> ...
>   }
> 

A lot of the uglyness of this code and some of the duplicate nature
of it are due to the fact that this comes directly from the HW
designers and is consumed as HW gospel. Years ago we tried to write
our own beautiful code to implement the HW bandwidth constraints and
calculations. This proved problematic as we had an argument with
HW designers each time there was a bug and first had to try to
prove that our own code was good. Consuming their code (more-or-less)
as-is short circuits any of these arguments and has lead to a
more stable driver, even with HW where the bandwidth calculations
have become more and more complex.

We do want this code to be kosher for the kernel but beyond that
we want to refactor this code as little as possible for the
above-stated reasons.

Note the "NOTE" at the top of the file regarding this.

As for the specifics of this patch, I'll leave that to Siqueira,
since he's been a lot deeper into this code than I have.

Harry

> Signed-off-by: Rafael Mendonca 
> ---
>  .../dc/dml/dcn314/display_mode_vba_314.c  | 106 +-
>  .../dc/dml/dcn314/display_rq_dlg_calc_314.c   |  91 +--
>  2 files changed, 6 insertions(+), 191 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c 
> b/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c
> index 0d12fd079cd6..6e43cd21a7d3 100644
> --- a/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c
> +++ b/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c
> @@ -32,6 +32,7 @@
>  #include "../display_mode_lib.h"
>  #include "display_mode_vba_314.h"
>  #include "../dml_inline_defs.h"
> +#include "dml/dcn30/display_mode_vba_30.h"
>  
>  /*
>   * NOTE:
> @@ -90,17 +91,6 @@ typedef struct {
>  #define BPP_INVALID 0
>  #define BPP_BLENDED_PIPE 0x
>  
> -static bool CalculateBytePerPixelAnd256BBlockSizes(
> - enum source_format_class SourcePixelFormat,
> - enum dm_swizzle_mode SurfaceTiling,
> - unsigned int *BytePerPixelY,
> - unsigned int *BytePerPixelC,
> - double *BytePerPixelDETY,
> - double *BytePerPixelDETC,
> - unsigned int *BlockHeight256BytesY,
> - unsigned int *BlockHeight256BytesC,
> - unsigned int *BlockWidth256BytesY,
> - unsigned int *BlockWidth256BytesC);
>  static void DisplayPipeConfiguration(struct display_mode_lib *mode_lib);
>  static void 
> DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerformanceCalculation(struct
>  display_mode_lib *mode_lib);
>  static unsigned int dscceComputeDelay(
> @@ -2178,7 +2168,7 @@ static void 
> DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerforman
>   DTRACE("   return_bus_bw  = %f", v->ReturnBW);
>  
>   for (k = 0; k < v->NumberOfActivePlanes; ++k) {
> - CalculateBytePerPixelAnd256BBlockSizes(
> + dml30_CalculateBytePerPixelAnd256BBlockSizes(
>   v->SourcePixelFormat[k],
>   v->SurfaceTiling[k],
>   &v->BytePerPixelY[k],
> @@ -3317,7 +3307,7 @@ static void DisplayPipeConfiguration(struct 
> display_mode_lib *mode_lib)
>  
>   for (k = 0; k < v->NumberOfActivePlanes; ++k) {
>  
> - CalculateBytePerPixelAnd256BBlockSizes(
> + dml30_CalculateBytePerPixelAnd256BBlockSizes(
>   v->SourcePixelFormat[k],
>   v->SurfaceTiling[k],
>   &BytePerPixY[k],
> @@ -3371,94 +3361,6 @@ static void DisplayPipeConfiguration(struct 
> display_mode_lib *mode_lib)
>   &dummysinglestring);
>  }
>  
> -static bool CalculateBytePerPixelAnd256BBlockSizes(
> - enum source_format_class SourcePixelFormat,
> - enum dm_swizzle_mode SurfaceTiling,
> - unsigned int *BytePerPixelY,
> - unsigned int *BytePerPixelC,
> - double *BytePerPixelDETY,
> - double *BytePerPixelDETC,
> -

Re: [PATCH v2] drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume

2022-10-21 Thread Alex Deucher

On Fri, Oct 21, 2022 at 10:47 AM Prike Liang  wrote:
>
> In the S2idle suspend/resume phase the gfxoff is keeping functional so
> some IP blocks will be likely to reinitialize at gfxoff entry and that
> will result in failing to program GC registers.Therefore, let disallow
> gfxoff until AMDGPU IPs reinitialized completely.
>
> Signed-off-by: Prike Liang 
> ---
> -v2: Move the operation of exiting gfxoff from smu to higer layer in 
> amdgpu_device.c.
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5b8362727226..36c44625932e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3210,6 +3210,12 @@ static int amdgpu_device_ip_resume_phase2(struct 
> amdgpu_device *adev)
> return r;
> }
> adev->ip_blocks[i].status.hw = true;
> +
> +   if (adev->in_s0ix && adev->ip_blocks[i].version->type == 
> AMD_IP_BLOCK_TYPE_SMC) {

Add a comment here something like:
/* disable gfxoff for IP resume.  gfxoff is re-enabled in
amdgpu_device_resume() after IP resume */

> +   amdgpu_gfx_off_ctrl(adev, false);
> +   DRM_DEBUG("will disable gfxoff for re-initializing 
> other blocks\n");
> +   }
> +
> }
>
> return 0;
> @@ -4185,6 +4191,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> fbcon)
> /* Make sure IB tests flushed */
> flush_delayed_work(&adev->delayed_init_work);
>
> +   if (adev->in_s0ix) {

Add a comment here something like:
/* re-enable gfxoff after IP resume.  This re-enables gfxoff after it
was disabled for IP resume in amdgpu_device_ip_resume_phase2() */

This those comments added, the patch is:
Acked-by: Alex Deucher 

> +   amdgpu_gfx_off_ctrl(adev, true);
> +   DRM_DEBUG("will enable gfxoff for the mission mode\n");
> +   }
> if (fbcon)
> 
> drm_fb_helper_set_suspend_unlocked(adev_to_drm(adev)->fb_helper, false);
>
> --
> 2.25.1
>

[PATCH v2] drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume

2022-10-21 Thread Prike Liang

In the S2idle suspend/resume phase the gfxoff is keeping functional so
some IP blocks will be likely to reinitialize at gfxoff entry and that
will result in failing to program GC registers.Therefore, let disallow
gfxoff until AMDGPU IPs reinitialized completely.

Signed-off-by: Prike Liang 
---
-v2: Move the operation of exiting gfxoff from smu to higer layer in 
amdgpu_device.c.

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5b8362727226..36c44625932e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3210,6 +3210,12 @@ static int amdgpu_device_ip_resume_phase2(struct 
amdgpu_device *adev)
return r;
}
adev->ip_blocks[i].status.hw = true;
+
+   if (adev->in_s0ix && adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_SMC) {
+   amdgpu_gfx_off_ctrl(adev, false);
+   DRM_DEBUG("will disable gfxoff for re-initializing 
other blocks\n");
+   }
+
}
 
return 0;
@@ -4185,6 +4191,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
/* Make sure IB tests flushed */
flush_delayed_work(&adev->delayed_init_work);
 
+   if (adev->in_s0ix) {
+   amdgpu_gfx_off_ctrl(adev, true);
+   DRM_DEBUG("will enable gfxoff for the mission mode\n");
+   }
if (fbcon)

drm_fb_helper_set_suspend_unlocked(adev_to_drm(adev)->fb_helper, false);
 
-- 
2.25.1

Re: [PATCH] drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume

2022-10-21 Thread Alex Deucher

On Fri, Oct 21, 2022 at 10:10 AM Liang, Prike  wrote:
>
> [Public]
>
> -Original Message-
> From: Alex Deucher 
> Sent: Friday, October 21, 2022 9:39 PM
> To: Liang, Prike 
> Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> ; Huang, Ray 
> Subject: Re: [PATCH] drm/amdgpu: disallow gfxoff until GC IP blocks complete 
> s2idle resume
>
> On Thu, Oct 20, 2022 at 10:30 PM Prike Liang  wrote:
> >
> > In the S2idle suspend/resume phase the gfxoff is keeping functional so
> > some IP blocks will be likely to reinitialize at gfxoff entry and that
> > will result in failing to program GC registers.Therefore, let disallow
> > gfxoff until AMDGPU IPs reinitialized completely.
> >
> > Signed-off-by: Prike Liang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
> > drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 5 +
> >  2 files changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index e0445e8cc342..1624ed15fbc4 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4185,6 +4185,10 @@ int amdgpu_device_resume(struct drm_device *dev, 
> > bool fbcon)
> > /* Make sure IB tests flushed */
> > flush_delayed_work(&adev->delayed_init_work);
> >
> > +   if (adev->in_s0ix) {
> > +   amdgpu_gfx_off_ctrl(adev, true);
> > +   DRM_DEBUG("will enable gfxoff for the mission mode\n");
> > +   }
> > if (fbcon)
> >
> > drm_fb_helper_set_suspend_unlocked(adev_to_drm(adev)->fb_helper,
> > false);
> >
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > index 4fe75dd2b329..3948dc5b1d6a 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > @@ -1664,6 +1664,11 @@ static int smu_resume(void *handle)
> >
> > dev_info(adev->dev, "SMU is resumed successfully!\n");
> >
> > +   if (adev->in_s0ix) {
> > +   amdgpu_gfx_off_ctrl(adev, false);
> > +   dev_dbg(adev->dev, "will disable gfxoff for re-initializing 
> > other blocks\n");
> > +   }
> > +
>
> I think it would be better to put this in amdgpu_device.c so it's clear where 
> its match is.  Also for raven based boards this will get missed because they 
> still use the powerplay power code.
>
> Alex
>
> [Prike] I miss this amdgpu_gfx_off_ctrl() is a generic upper layer function 
> but that should also work on Rave series, since on the Rave series will use 
> another message PPSMC_MSG_GpuChangeState exit gfxoff. But it makes sense 
> unify the operation of exiting gfxoff at an upper layer in amdgpu_device.c 
> and I will update it at patch v2.
>

Ok.  If you can move it into amdgpu_device.c that would be great, but
if not, either way, please add comments to these functions noting
where their match is and why this is necessary.

Thanks!

Alex

> Thanks,
> Prike
> > return 0;
> >  }
> >
> > --
> > 2.25.1
> >

RE: [PATCH] drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume

2022-10-21 Thread Liang, Prike

[Public]

-Original Message-
From: Alex Deucher 
Sent: Friday, October 21, 2022 9:39 PM
To: Liang, Prike 
Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Huang, Ray 
Subject: Re: [PATCH] drm/amdgpu: disallow gfxoff until GC IP blocks complete 
s2idle resume

On Thu, Oct 20, 2022 at 10:30 PM Prike Liang  wrote:
>
> In the S2idle suspend/resume phase the gfxoff is keeping functional so
> some IP blocks will be likely to reinitialize at gfxoff entry and that
> will result in failing to program GC registers.Therefore, let disallow
> gfxoff until AMDGPU IPs reinitialized completely.
>
> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
> drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 5 +
>  2 files changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e0445e8cc342..1624ed15fbc4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4185,6 +4185,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> fbcon)
> /* Make sure IB tests flushed */
> flush_delayed_work(&adev->delayed_init_work);
>
> +   if (adev->in_s0ix) {
> +   amdgpu_gfx_off_ctrl(adev, true);
> +   DRM_DEBUG("will enable gfxoff for the mission mode\n");
> +   }
> if (fbcon)
>
> drm_fb_helper_set_suspend_unlocked(adev_to_drm(adev)->fb_helper,
> false);
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 4fe75dd2b329..3948dc5b1d6a 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -1664,6 +1664,11 @@ static int smu_resume(void *handle)
>
> dev_info(adev->dev, "SMU is resumed successfully!\n");
>
> +   if (adev->in_s0ix) {
> +   amdgpu_gfx_off_ctrl(adev, false);
> +   dev_dbg(adev->dev, "will disable gfxoff for re-initializing 
> other blocks\n");
> +   }
> +

I think it would be better to put this in amdgpu_device.c so it's clear where 
its match is.  Also for raven based boards this will get missed because they 
still use the powerplay power code.

Alex

[Prike] I miss this amdgpu_gfx_off_ctrl() is a generic upper layer function but 
that should also work on Rave series, since on the Rave series will use another 
message PPSMC_MSG_GpuChangeState exit gfxoff. But it makes sense unify the 
operation of exiting gfxoff at an upper layer in amdgpu_device.c and I will 
update it at patch v2.

Thanks,
Prike
> return 0;
>  }
>
> --
> 2.25.1
>

Re: [PATCH] drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume

2022-10-21 Thread Alex Deucher

On Thu, Oct 20, 2022 at 10:30 PM Prike Liang  wrote:
>
> In the S2idle suspend/resume phase the gfxoff is keeping functional so
> some IP blocks will be likely to reinitialize at gfxoff entry and that
> will result in failing to program GC registers.Therefore, let disallow
> gfxoff until AMDGPU IPs reinitialized completely.
>
> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 5 +
>  2 files changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e0445e8cc342..1624ed15fbc4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4185,6 +4185,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> fbcon)
> /* Make sure IB tests flushed */
> flush_delayed_work(&adev->delayed_init_work);
>
> +   if (adev->in_s0ix) {
> +   amdgpu_gfx_off_ctrl(adev, true);
> +   DRM_DEBUG("will enable gfxoff for the mission mode\n");
> +   }
> if (fbcon)
> 
> drm_fb_helper_set_suspend_unlocked(adev_to_drm(adev)->fb_helper, false);
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 4fe75dd2b329..3948dc5b1d6a 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -1664,6 +1664,11 @@ static int smu_resume(void *handle)
>
> dev_info(adev->dev, "SMU is resumed successfully!\n");
>
> +   if (adev->in_s0ix) {
> +   amdgpu_gfx_off_ctrl(adev, false);
> +   dev_dbg(adev->dev, "will disable gfxoff for re-initializing 
> other blocks\n");
> +   }
> +

I think it would be better to put this in amdgpu_device.c so it's
clear where its match is.  Also for raven based boards this will get
missed because they still use the powerplay power code.

Alex

> return 0;
>  }
>
> --
> 2.25.1
>

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Yang Yingliang




On 2022/10/21 16:41, Luben Tuikov wrote:

On 2022-10-21 04:24, Luben Tuikov wrote:

On 2022-10-21 04:18, Greg KH wrote:

On Fri, Oct 21, 2022 at 03:55:18AM -0400, Luben Tuikov wrote:

On 2022-10-21 01:37, Greg KH wrote:

On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:

On 2022-10-20 22:20, Yang Yingliang wrote:

The previous discussion link:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7Cd41da3fd6449492d01f808dab33cdb75%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019371236833115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=C%2Bj1THkHpzVGks5eqB%2Fm%2FPAkMRohR7CYvRnOCqUqdcM%3D&reserved=0

The very first discussion on this was here:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7Cd41da3fd6449492d01f808dab33cdb75%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019371236833115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pSR10abmK8nAMvKSezqWC0SPUBL4qEwtCCizyIKW7Dc%3D&reserved=0

Please use this link, and not the that one up there you which quoted above,
and whose commit description is taken verbatim from the this link.


kset_register() is currently used in some places without calling
kset_put() in error path, because the callers think it should be
kset internal thing to do, but the driver core can not know what
caller doing with that memory at times. The memory could be freed
both in kset_put() and error path of caller, if it is called in
kset_register().

As I explained in the link above, the reason there's
a memory leak is that one cannot call kset_register() without
the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
in this case, i.e. kset_register() fails with -EINVAL.

Thus, the most common usage is something like this:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

So, what is being leaked, is the memory allocated in kobj_set_name(),
by the common idiom shown above. This needs to be mentioned in
the documentation, at least, in case, in the future this is absolved
in kset_register() redesign, etc.

Based on this, can kset_register() just clean up from itself when an
error happens?  Ideally that would be the case, as the odds of a kset
being embedded in a larger structure is probably slim, but we would have
to search the tree to make sure.

Looking at kset_register(), we can add kset_put() in the error path,
when kobject_add_internal(&kset->kobj) fails.

See the attached patch. It needs to be tested with the same error injection
as Yang has been doing.

Now, struct kset is being embedded in larger structs--see amdgpu_discovery.c
starting at line 575. If you're on an AMD system, it gets you the tree
structure you'll see when you run "tree 
/sys/class/drm/card0/device/ip_discovery/".
That shouldn't be a problem though.

Yes, that shouldn't be an issue as the kobject embedded in a kset is
ONLY for that kset itself, the kset structure should not be controling
the lifespan of the object it is embedded in, right?

Yes, and it doesn't. It only does a kobject_get(parent) and kobject_put(parent).
So that's fine and natural.

Yang, do you want to try the patch in my previous email in this thread, since 
you've
got the error injection set up already?

I spoke too soon. I believe you're onto something, because looking at the idiom:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

The ktype defines a release method, which frees the larger struct the kset is 
embedded in.
And this would result in double free, for instance in the amdgpu_discovery.c 
code, if
kset_put() is called after kset_register() fails, since we kzalloc the larger 
object
just before and kfree it on error just after. Ideally, we'd only "revert" the 
actions
done by kobj_set_name(), as there's some error recovery on create_dir() in 
kobject_add_internal().

So, we cannot do this business with the kset_put() on error from 
kset_register(), after all.
Not sure how this wasn't caught in Yang's testing--the kernel should've 
complained.
I have already tried the change that in your posted patch when I was 
debugging this issue
before sent these fixes, because it may lead double free in some cases, 
and I had mentioned

it in this mail:

https://lore.kernel.org/lkml/0db486eb-6927-927e-3629-958f8f211...@huawei.com/T/#m68eade1993859dfc6c3d14d35f988d35a32ef837

Thanks,
Yang


Regards,
Luben

.

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Yang Yingliang




On 2022/10/21 16:36, Greg KH wrote:

On Fri, Oct 21, 2022 at 04:24:23PM +0800, Yang Yingliang wrote:

On 2022/10/21 13:37, Greg KH wrote:

On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:

On 2022-10-20 22:20, Yang Yingliang wrote:

The previous discussion link:
https://lore.kernel.org/lkml/0db486eb-6927-927e-3629-958f8f211...@huawei.com/T/

The very first discussion on this was here:

https://www.spinics.net/lists/dri-devel/msg368077.html

Please use this link, and not the that one up there you which quoted above,
and whose commit description is taken verbatim from the this link.


kset_register() is currently used in some places without calling
kset_put() in error path, because the callers think it should be
kset internal thing to do, but the driver core can not know what
caller doing with that memory at times. The memory could be freed
both in kset_put() and error path of caller, if it is called in
kset_register().

As I explained in the link above, the reason there's
a memory leak is that one cannot call kset_register() without
the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
in this case, i.e. kset_register() fails with -EINVAL.

Thus, the most common usage is something like this:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

So, what is being leaked, is the memory allocated in kobj_set_name(),
by the common idiom shown above. This needs to be mentioned in
the documentation, at least, in case, in the future this is absolved
in kset_register() redesign, etc.

Based on this, can kset_register() just clean up from itself when an
error happens?  Ideally that would be the case, as the odds of a kset
being embedded in a larger structure is probably slim, but we would have
to search the tree to make sure.

I have search the whole tree, the kset used in bus_register() - patch #3,
kset_create_and_add() - patch #4
__class_register() - patch #5,  fw_cfg_build_symlink() - patch #6 and
amdgpu_discovery.c - patch #10
is embedded in a larger structure. In these cases, we can not call
kset_put() in error path in kset_register()

Yes you can as the kobject in the kset should NOT be controling the
lifespan of those larger objects.

If it is, please point out the call chain here as I don't think that
should be possible.

Note all of this is a mess because the kobject name stuff was added much
later, after the driver model had been created and running for a while.
We missed this error path when adding the dynamic kobject name logic,
thank for looking into this.

If you could test the patch posted with your error injection systems,
that could make this all much simpler to solve.


The patch posted by Luben will cause double free in some cases.


From 71e0a22801c0699f67ea40ed96e0a7d7d9e0f318 Mon Sep 17 00:00:00 2001
From: Luben Tuikov 
Date: Fri, 21 Oct 2022 03:34:21 -0400
Subject: [PATCH] kobject: Add kset_put() if kset_register() fails
X-check-string-leak: v1.0

If kset_register() fails, we call kset_put() before returning the
error. This makes sure that we free memory allocated by kobj_set_name()
for the kset, since kset_register() cannot be called unless the kset has
a name, usually gotten via kobj_set_name(&kset->kobj, format, ...);

Cc: Greg Kroah-Hartman 
Cc: Rafael J. Wysocki 
Cc: Yang Yingliang 
Cc: Linux Kernel Mailing List 
Signed-off-by: Luben Tuikov 
---
 lib/kobject.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/kobject.c b/lib/kobject.c
index a0b2dbfcfa2334..c122b979f2b75e 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -844,8 +844,10 @@ int kset_register(struct kset *k)

 kset_init(k);
 err = kobject_add_internal(&k->kobj);
-    if (err)
+    if (err) {
+        kset_put(k);
     return err;
+    }
 kobject_uevent(&k->kobj, KOBJ_ADD);
 return 0;
 }
--
2.38.0-rc2



thanks,

greg k-h
.

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Yang Yingliang




On 2022/10/21 16:36, Greg KH wrote:

On Fri, Oct 21, 2022 at 04:24:23PM +0800, Yang Yingliang wrote:

On 2022/10/21 13:37, Greg KH wrote:

On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:

On 2022-10-20 22:20, Yang Yingliang wrote:

The previous discussion link:
https://lore.kernel.org/lkml/0db486eb-6927-927e-3629-958f8f211...@huawei.com/T/

The very first discussion on this was here:

https://www.spinics.net/lists/dri-devel/msg368077.html

Please use this link, and not the that one up there you which quoted above,
and whose commit description is taken verbatim from the this link.


kset_register() is currently used in some places without calling
kset_put() in error path, because the callers think it should be
kset internal thing to do, but the driver core can not know what
caller doing with that memory at times. The memory could be freed
both in kset_put() and error path of caller, if it is called in
kset_register().

As I explained in the link above, the reason there's
a memory leak is that one cannot call kset_register() without
the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
in this case, i.e. kset_register() fails with -EINVAL.

Thus, the most common usage is something like this:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

So, what is being leaked, is the memory allocated in kobj_set_name(),
by the common idiom shown above. This needs to be mentioned in
the documentation, at least, in case, in the future this is absolved
in kset_register() redesign, etc.

Based on this, can kset_register() just clean up from itself when an
error happens?  Ideally that would be the case, as the odds of a kset
being embedded in a larger structure is probably slim, but we would have
to search the tree to make sure.

I have search the whole tree, the kset used in bus_register() - patch #3,
kset_create_and_add() - patch #4
__class_register() - patch #5,  fw_cfg_build_symlink() - patch #6 and
amdgpu_discovery.c - patch #10
is embedded in a larger structure. In these cases, we can not call
kset_put() in error path in kset_register()

Yes you can as the kobject in the kset should NOT be controling the
lifespan of those larger objects.
Read through the code the only leak in this case is the name, so can we 
free it

directly in kset_register():

--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -844,8 +844,11 @@ int kset_register(struct kset *k)

    kset_init(k);
    err = kobject_add_internal(&k->kobj);
-   if (err)
+   if (err) {
+   kfree_const(k->kobj.name);
+   k->kobj.name = NULL;
    return err;
+   }
    kobject_uevent(&k->kobj, KOBJ_ADD);
    return 0;
 }

or unset ktype of kobject, then call kset_put():

--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -844,8 +844,11 @@ int kset_register(struct kset *k)

    kset_init(k);
    err = kobject_add_internal(&k->kobj);
-   if (err)
+   if (err) {
+   k->kobj.ktype = NULL;
+   kset_put(k);
    return err;
+   }
    kobject_uevent(&k->kobj, KOBJ_ADD);
    return 0;
 }



If it is, please point out the call chain here as I don't think that
should be possible.

Note all of this is a mess because the kobject name stuff was added much
later, after the driver model had been created and running for a while.
We missed this error path when adding the dynamic kobject name logic,
thank for looking into this.

If you could test the patch posted with your error injection systems,
that could make this all much simpler to solve.

thanks,

greg k-h
.

Re: [PATCH 01/11] kset: fix documentation for kset_register()

2022-10-21 Thread Yang Yingliang




On 2022/10/21 13:34, Luben Tuikov wrote:

On 2022-10-20 22:20, Yang Yingliang wrote:

kset_register() is currently used in some places without calling
kset_put() in error path, because the callers think it should be
kset internal thing to do, but the driver core can not know what
caller doing with that memory at times. The memory could be freed
both in kset_put() and error path of caller, if it is called in
kset_register().

So make the function documentation more explicit about calling
kset_put() in the error path of caller.

Signed-off-by: Yang Yingliang 
---
  lib/kobject.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/lib/kobject.c b/lib/kobject.c
index a0b2dbfcfa23..6da04353d974 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -834,6 +834,9 @@ EXPORT_SYMBOL_GPL(kobj_sysfs_ops);
  /**
   * kset_register() - Initialize and add a kset.
   * @k: kset.
+ *
+ * If this function returns an error, kset_put() must be called to
+ * properly clean up the memory associated with the object.
   */

And I'd continue the sentence, with " ... with the object,
for instance the memory for the kset.kobj.name when kobj_set_name(&kset.kobj, 
format, ...)
was called before calling kset_register()."
kobject_cleanup() not only frees name, but aslo calls ->release() to 
free another resources.


This makes it clear what we want to make sure is freed, in case of an early 
error
from kset_register().


How about like this:

If this function returns an error, kset_put() must be called to clean up the 
name of
kset object and other memory associated with the object.



Regards,
Luben


  int kset_register(struct kset *k)
  {

.

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Yang Yingliang


Hi,

On 2022/10/21 13:29, Luben Tuikov wrote:

On 2022-10-20 22:20, Yang Yingliang wrote:

The previous discussion link:
https://lore.kernel.org/lkml/0db486eb-6927-927e-3629-958f8f211...@huawei.com/T/

The very first discussion on this was here:

https://www.spinics.net/lists/dri-devel/msg368077.html

Please use this link, and not the that one up there you which quoted above,
and whose commit description is taken verbatim from the this link.
I found this leaks in 
bus_register()/class_register()/kset_create_and_add() at first, and describe
the reason in these patches which is using kobject_set_name() 
description, here is the patches:


https://lore.kernel.org/lkml/20221017014957.156645-1-yangyingli...@huawei.com/T/
https://lore.kernel.org/lkml/20221017031335.1845383-1-yangyingli...@huawei.com/
https://lore.kernel.org/lkml/y0zfpkagqsryz...@kroah.com/T/

And then I found other subsystem also have this problem, so posted the 
fix patches for them

(including qemu_fw_cfg/f2fs/erofs/ocfs2/amdgpu_discovery):

https://www.mail-archive.com/qemu-devel@nongnu.org/msg915553.html
https://lore.kernel.org/linux-f2fs-devel/7908686b-9a7c-b754-d312-d689fc283...@kernel.org/T/#t
https://lore.kernel.org/linux-erofs/20221018073947.693206-1-yangyingli...@huawei.com/
https://lore.kernel.org/lkml/0db486eb-6927-927e-3629-958f8f211...@huawei.com/T/

https://www.spinics.net/lists/dri-devel/msg368092.html
In the amdgpu_discovery patch, I sent a old one which using wrong 
description and you pointer out,

and then I send a v2.

And then the maintainer of ocfs2 has different thought about this, so we 
had a discussion in the link
that I gave out, and Greg suggested me to update kset_register() 
documentation and then put the fix

patches together in one series, so I sent this patchset and use the link.

Thanks,
Yang




kset_register() is currently used in some places without calling
kset_put() in error path, because the callers think it should be
kset internal thing to do, but the driver core can not know what
caller doing with that memory at times. The memory could be freed
both in kset_put() and error path of caller, if it is called in
kset_register().

As I explained in the link above, the reason there's
a memory leak is that one cannot call kset_register() without
the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
in this case, i.e. kset_register() fails with -EINVAL.

Thus, the most common usage is something like this:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

So, what is being leaked, is the memory allocated in kobj_set_name(),
by the common idiom shown above. This needs to be mentioned in
the documentation, at least, in case, in the future this is absolved
in kset_register() redesign, etc.

Regards,
Luben


So make the function documentation more explicit about calling
kset_put() in the error path of caller first, so that people
have a chance to know what to do here, then fixes this leaks
by calling kset_put() from callers.

Liu Shixin (1):
   ubifs: Fix memory leak in ubifs_sysfs_init()

Yang Yingliang (10):
   kset: fix documentation for kset_register()
   kset: add null pointer check in kset_put()
   bus: fix possible memory leak in bus_register()
   kobject: fix possible memory leak in kset_create_and_add()
   class: fix possible memory leak in __class_register()
   firmware: qemu_fw_cfg: fix possible memory leak in
 fw_cfg_build_symlink()
   f2fs: fix possible memory leak in f2fs_init_sysfs()
   erofs: fix possible memory leak in erofs_init_sysfs()
   ocfs2: possible memory leak in mlog_sys_init()
   drm/amdgpu/discovery: fix possible memory leak

  drivers/base/bus.c| 4 +++-
  drivers/base/class.c  | 6 ++
  drivers/firmware/qemu_fw_cfg.c| 2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 5 +++--
  fs/erofs/sysfs.c  | 4 +++-
  fs/f2fs/sysfs.c   | 4 +++-
  fs/ocfs2/cluster/masklog.c| 7 ++-
  fs/ubifs/sysfs.c  | 2 ++
  include/linux/kobject.h   | 3 ++-
  lib/kobject.c | 5 -
  10 files changed, 33 insertions(+), 9 deletions(-)


.

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Greg KH

On Fri, Oct 21, 2022 at 04:24:23PM +0800, Yang Yingliang wrote:
> 
> On 2022/10/21 13:37, Greg KH wrote:
> > On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
> > > On 2022-10-20 22:20, Yang Yingliang wrote:
> > > > The previous discussion link:
> > > > https://lore.kernel.org/lkml/0db486eb-6927-927e-3629-958f8f211...@huawei.com/T/
> > > The very first discussion on this was here:
> > > 
> > > https://www.spinics.net/lists/dri-devel/msg368077.html
> > > 
> > > Please use this link, and not the that one up there you which quoted 
> > > above,
> > > and whose commit description is taken verbatim from the this link.
> > > 
> > > > kset_register() is currently used in some places without calling
> > > > kset_put() in error path, because the callers think it should be
> > > > kset internal thing to do, but the driver core can not know what
> > > > caller doing with that memory at times. The memory could be freed
> > > > both in kset_put() and error path of caller, if it is called in
> > > > kset_register().
> > > As I explained in the link above, the reason there's
> > > a memory leak is that one cannot call kset_register() without
> > > the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
> > > in this case, i.e. kset_register() fails with -EINVAL.
> > > 
> > > Thus, the most common usage is something like this:
> > > 
> > >   kobj_set_name(&kset->kobj, format, ...);
> > >   kset->kobj.kset = parent_kset;
> > >   kset->kobj.ktype = ktype;
> > >   res = kset_register(kset);
> > > 
> > > So, what is being leaked, is the memory allocated in kobj_set_name(),
> > > by the common idiom shown above. This needs to be mentioned in
> > > the documentation, at least, in case, in the future this is absolved
> > > in kset_register() redesign, etc.
> > Based on this, can kset_register() just clean up from itself when an
> > error happens?  Ideally that would be the case, as the odds of a kset
> > being embedded in a larger structure is probably slim, but we would have
> > to search the tree to make sure.
> I have search the whole tree, the kset used in bus_register() - patch #3,
> kset_create_and_add() - patch #4
> __class_register() - patch #5,  fw_cfg_build_symlink() - patch #6 and
> amdgpu_discovery.c - patch #10
> is embedded in a larger structure. In these cases, we can not call
> kset_put() in error path in kset_register()

Yes you can as the kobject in the kset should NOT be controling the
lifespan of those larger objects.

If it is, please point out the call chain here as I don't think that
should be possible.

Note all of this is a mess because the kobject name stuff was added much
later, after the driver model had been created and running for a while.
We missed this error path when adding the dynamic kobject name logic,
thank for looking into this.

If you could test the patch posted with your error injection systems,
that could make this all much simpler to solve.

thanks,

greg k-h

Re: [PATCH 01/11] kset: fix documentation for kset_register()

2022-10-21 Thread Greg KH

On Fri, Oct 21, 2022 at 04:05:18PM +0800, Yang Yingliang wrote:
> 
> On 2022/10/21 13:34, Luben Tuikov wrote:
> > On 2022-10-20 22:20, Yang Yingliang wrote:
> > > kset_register() is currently used in some places without calling
> > > kset_put() in error path, because the callers think it should be
> > > kset internal thing to do, but the driver core can not know what
> > > caller doing with that memory at times. The memory could be freed
> > > both in kset_put() and error path of caller, if it is called in
> > > kset_register().
> > > 
> > > So make the function documentation more explicit about calling
> > > kset_put() in the error path of caller.
> > > 
> > > Signed-off-by: Yang Yingliang 
> > > ---
> > >   lib/kobject.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/lib/kobject.c b/lib/kobject.c
> > > index a0b2dbfcfa23..6da04353d974 100644
> > > --- a/lib/kobject.c
> > > +++ b/lib/kobject.c
> > > @@ -834,6 +834,9 @@ EXPORT_SYMBOL_GPL(kobj_sysfs_ops);
> > >   /**
> > >* kset_register() - Initialize and add a kset.
> > >* @k: kset.
> > > + *
> > > + * If this function returns an error, kset_put() must be called to
> > > + * properly clean up the memory associated with the object.
> > >*/
> > And I'd continue the sentence, with " ... with the object,
> > for instance the memory for the kset.kobj.name when 
> > kobj_set_name(&kset.kobj, format, ...)
> > was called before calling kset_register()."
> kobject_cleanup() not only frees name, but aslo calls ->release() to free
> another resources.

Yes, but it's the kobject of the kset, which does need to have it's name
cleaned up, but that kobject should NOT be freeing any larger structures
that the kset might be embedded in, right?

> > This makes it clear what we want to make sure is freed, in case of an early 
> > error
> > from kset_register().
> 
> How about like this:
> 
> If this function returns an error, kset_put() must be called to clean up the 
> name of
> kset object and other memory associated with the object.

Again, I think we can fix this up to not be needed.

thanks,

greg k-h

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Greg KH

On Fri, Oct 21, 2022 at 03:55:18AM -0400, Luben Tuikov wrote:
> On 2022-10-21 01:37, Greg KH wrote:
> > On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
> >> On 2022-10-20 22:20, Yang Yingliang wrote:
> >>> The previous discussion link:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7C65b33f087ef245a9f23708dab3264840%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019274318153227%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=1ZoieEob62iU9kI8fvpp20qGut9EeHKIHtCAT01t%2Bz8%3D&reserved=0
> >>
> >> The very first discussion on this was here:
> >>
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7C65b33f087ef245a9f23708dab3264840%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019274318153227%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9joWxGLUxZZMvrfkxCR8KbkoXifsqoMK0vGR%2FyEG62w%3D&reserved=0
> >>
> >> Please use this link, and not the that one up there you which quoted above,
> >> and whose commit description is taken verbatim from the this link.
> >>
> >>>
> >>> kset_register() is currently used in some places without calling
> >>> kset_put() in error path, because the callers think it should be
> >>> kset internal thing to do, but the driver core can not know what
> >>> caller doing with that memory at times. The memory could be freed
> >>> both in kset_put() and error path of caller, if it is called in
> >>> kset_register().
> >>
> >> As I explained in the link above, the reason there's
> >> a memory leak is that one cannot call kset_register() without
> >> the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
> >> in this case, i.e. kset_register() fails with -EINVAL.
> >>
> >> Thus, the most common usage is something like this:
> >>
> >>kobj_set_name(&kset->kobj, format, ...);
> >>kset->kobj.kset = parent_kset;
> >>kset->kobj.ktype = ktype;
> >>res = kset_register(kset);
> >>
> >> So, what is being leaked, is the memory allocated in kobj_set_name(),
> >> by the common idiom shown above. This needs to be mentioned in
> >> the documentation, at least, in case, in the future this is absolved
> >> in kset_register() redesign, etc.
> > 
> > Based on this, can kset_register() just clean up from itself when an
> > error happens?  Ideally that would be the case, as the odds of a kset
> > being embedded in a larger structure is probably slim, but we would have
> > to search the tree to make sure.
> 
> Looking at kset_register(), we can add kset_put() in the error path,
> when kobject_add_internal(&kset->kobj) fails.
> 
> See the attached patch. It needs to be tested with the same error injection
> as Yang has been doing.
> 
> Now, struct kset is being embedded in larger structs--see amdgpu_discovery.c
> starting at line 575. If you're on an AMD system, it gets you the tree
> structure you'll see when you run "tree 
> /sys/class/drm/card0/device/ip_discovery/".
> That shouldn't be a problem though.

Yes, that shouldn't be an issue as the kobject embedded in a kset is
ONLY for that kset itself, the kset structure should not be controling
the lifespan of the object it is embedded in, right?

Note, the use of ksets by a device driver like you are doing here in the
amd driver is BROKEN and will cause problems by userspace tools.  Don't
do that please, just use a single subdirectory for an attribute.  Doing
deeper stuff like this is sure to cause problems and be a headache.

thanks,

greg k-h

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Yang Yingliang




On 2022/10/21 17:08, Luben Tuikov wrote:

On 2022-10-21 04:59, Yang Yingliang wrote:

On 2022/10/21 16:36, Greg KH wrote:

On Fri, Oct 21, 2022 at 04:24:23PM +0800, Yang Yingliang wrote:

On 2022/10/21 13:37, Greg KH wrote:

On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:

On 2022-10-20 22:20, Yang Yingliang wrote:

The previous discussion link:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7C74aa9b57192b406ef27408dab3429db4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019395979868103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RcK05cXm1J5%2BtYcLO2SMG7k6sjeymQzdBzMCDJSzfdE%3D&reserved=0

The very first discussion on this was here:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7C74aa9b57192b406ef27408dab3429db4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019395979868103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sHZ6kfLF8HxrNXV6%2FVjgdH%2BmQM4T3Zv0U%2FAwddT97cE%3D&reserved=0

Please use this link, and not the that one up there you which quoted above,
and whose commit description is taken verbatim from the this link.


kset_register() is currently used in some places without calling
kset_put() in error path, because the callers think it should be
kset internal thing to do, but the driver core can not know what
caller doing with that memory at times. The memory could be freed
both in kset_put() and error path of caller, if it is called in
kset_register().

As I explained in the link above, the reason there's
a memory leak is that one cannot call kset_register() without
the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
in this case, i.e. kset_register() fails with -EINVAL.

Thus, the most common usage is something like this:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

So, what is being leaked, is the memory allocated in kobj_set_name(),
by the common idiom shown above. This needs to be mentioned in
the documentation, at least, in case, in the future this is absolved
in kset_register() redesign, etc.

Based on this, can kset_register() just clean up from itself when an
error happens?  Ideally that would be the case, as the odds of a kset
being embedded in a larger structure is probably slim, but we would have
to search the tree to make sure.

I have search the whole tree, the kset used in bus_register() - patch #3,
kset_create_and_add() - patch #4
__class_register() - patch #5,  fw_cfg_build_symlink() - patch #6 and
amdgpu_discovery.c - patch #10
is embedded in a larger structure. In these cases, we can not call
kset_put() in error path in kset_register()

Yes you can as the kobject in the kset should NOT be controling the
lifespan of those larger objects.

If it is, please point out the call chain here as I don't think that
should be possible.

Note all of this is a mess because the kobject name stuff was added much
later, after the driver model had been created and running for a while.
We missed this error path when adding the dynamic kobject name logic,
thank for looking into this.

If you could test the patch posted with your error injection systems,
that could make this all much simpler to solve.

The patch posted by Luben will cause double free in some cases.

Yes, I figured this out in the other email and posted the scenario Greg
was asking about.

But I believe the question still stands if we can do kset_put()
after a *failed* kset_register(), namely if more is being done than
necessary, which is just to free the memory allocated by
kobject_set_name().
The name memory is allocated in kobject_set_name() in caller,  and I 
think caller
free the memory that it allocated is reasonable, it's weird that some 
callers allocate
some memory and use function (kset_register) failed, then it free the 
memory allocated
in callers,  I think use kset_put()/kfree_const(name) in caller seems 
more reasonable.


Thanks,
Yang


Regards,
Luben
.

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Yang Yingliang




On 2022/10/21 13:37, Greg KH wrote:

On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:

On 2022-10-20 22:20, Yang Yingliang wrote:

The previous discussion link:
https://lore.kernel.org/lkml/0db486eb-6927-927e-3629-958f8f211...@huawei.com/T/

The very first discussion on this was here:

https://www.spinics.net/lists/dri-devel/msg368077.html

Please use this link, and not the that one up there you which quoted above,
and whose commit description is taken verbatim from the this link.


kset_register() is currently used in some places without calling
kset_put() in error path, because the callers think it should be
kset internal thing to do, but the driver core can not know what
caller doing with that memory at times. The memory could be freed
both in kset_put() and error path of caller, if it is called in
kset_register().

As I explained in the link above, the reason there's
a memory leak is that one cannot call kset_register() without
the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
in this case, i.e. kset_register() fails with -EINVAL.

Thus, the most common usage is something like this:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

So, what is being leaked, is the memory allocated in kobj_set_name(),
by the common idiom shown above. This needs to be mentioned in
the documentation, at least, in case, in the future this is absolved
in kset_register() redesign, etc.

Based on this, can kset_register() just clean up from itself when an
error happens?  Ideally that would be the case, as the odds of a kset
being embedded in a larger structure is probably slim, but we would have
to search the tree to make sure.
I have search the whole tree, the kset used in bus_register() - patch 
#3, kset_create_and_add() - patch #4
__class_register() - patch #5,  fw_cfg_build_symlink() - patch #6 and 
amdgpu_discovery.c - patch #10
is embedded in a larger structure. In these cases, we can not call 
kset_put() in error path in kset_register()

itself.

Thanks,
Yang


thanks,

greg k-h
.

RE: [PATCH 1/2] drm/amdkfd: introduce dummy cache info for property asic

2022-10-21 Thread Liang, Prike

[Public]

-Original Message-
From: Kuehling, Felix 
Sent: Friday, October 21, 2022 1:11 PM
To: Liang, Prike ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Zhang, Yifan 
; Huang, Ray ; Liu, Aaron 

Subject: Re: [PATCH 1/2] drm/amdkfd: introduce dummy cache info for property 
asic

Am 2022-10-20 um 21:50 schrieb Liang, Prike:
> [Public]
>
> -Original Message-
> From: Kuehling, Felix 
> Sent: Friday, October 21, 2022 12:03 AM
> To: Liang, Prike ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Zhang, Yifan
> ; Huang, Ray ; Liu, Aaron
> 
> Subject: Re: [PATCH 1/2] drm/amdkfd: introduce dummy cache info for
> property asic
>
>
> Am 2022-10-20 um 05:15 schrieb Prike Liang:
>> This dummy cache info will enable kfd base function support.
>>
>> Signed-off-by: Prike Liang 
>> ---
>>drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 55 +--
>>1 file changed, 52 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>> index cd5f8b219bf9..960046e43b7a 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>> @@ -795,6 +795,54 @@ static struct kfd_gpu_cache_info 
>> yellow_carp_cache_info[] = {
>>},
>>};
>>
>> +static struct kfd_gpu_cache_info dummy_cache_info[] = {
>> + {
>> + /* TCP L1 Cache per CU */
>> + .cache_size = 16,
>> + .cache_level = 1,
>> + .flags = (CRAT_CACHE_FLAGS_ENABLED |
>> + CRAT_CACHE_FLAGS_DATA_CACHE |
>> + CRAT_CACHE_FLAGS_SIMD_CACHE),
>> + .num_cu_shared = 1,
>> + },
>> + {
>> + /* Scalar L1 Instruction Cache per SQC */
>> + .cache_size = 32,
>> + .cache_level = 1,
>> + .flags = (CRAT_CACHE_FLAGS_ENABLED |
>> + CRAT_CACHE_FLAGS_INST_CACHE |
>> + CRAT_CACHE_FLAGS_SIMD_CACHE),
>> + .num_cu_shared = 2,
>> + },
>> + {
>> + /* Scalar L1 Data Cache per SQC */
>> + .cache_size = 16,
>> + .cache_level = 1,
>> + .flags = (CRAT_CACHE_FLAGS_ENABLED |
>> + CRAT_CACHE_FLAGS_DATA_CACHE |
>> + CRAT_CACHE_FLAGS_SIMD_CACHE),
>> + .num_cu_shared = 2,
>> + },
>> + {
>> + /* GL1 Data Cache per SA */
>> + .cache_size = 128,
>> + .cache_level = 1,
>> + .flags = (CRAT_CACHE_FLAGS_ENABLED |
>> + CRAT_CACHE_FLAGS_DATA_CACHE |
>> + CRAT_CACHE_FLAGS_SIMD_CACHE),
>> + .num_cu_shared = 6,
>> + },
>> + {
>> + /* L2 Data Cache per GPU (Total Tex Cache) */
>> + .cache_size = 2048,
>> + .cache_level = 2,
>> + .flags = (CRAT_CACHE_FLAGS_ENABLED |
>> + CRAT_CACHE_FLAGS_DATA_CACHE |
>> + CRAT_CACHE_FLAGS_SIMD_CACHE),
>> + .num_cu_shared = 6,
>> + },
>> +};
>> +
>>static void kfd_populated_cu_info_cpu(struct kfd_topology_device *dev,
>>struct crat_subtype_computeunit *cu)
>>{
>> @@ -1514,8 +1562,6 @@ static int kfd_fill_gpu_cache_info(struct kfd_dev 
>> *kdev,
>>num_of_cache_types = 
>> ARRAY_SIZE(beige_goby_cache_info);
>>break;
>>case IP_VERSION(10, 3, 3):
>> - case IP_VERSION(10, 3, 6): /* TODO: Double check these on 
>> production silicon */
>> - case IP_VERSION(10, 3, 7): /* TODO: Double check these on 
>> production silicon */
>>pcache_info = yellow_carp_cache_info;
>>num_of_cache_types = 
>> ARRAY_SIZE(yellow_carp_cache_info);
>>break;
>> @@ -1528,7 +1574,10 @@ static int kfd_fill_gpu_cache_info(struct kfd_dev 
>> *kdev,
>>kfd_fill_gpu_cache_info_from_gfx_config(kdev, 
>> pcache_info);
>>break;
>>default:
>> - return -EINVAL;
>> + pcache_info = dummy_cache_info;
>> + num_of_cache_types = ARRAY_SIZE(dummy_cache_info);
>> + pr_warn("dummy cache info is used temporarily and real 
>> cache info need update later.\n");
>> + break;
> Could we make this return an error if the amdgpu.exp_hw_support module 
> parameter is not set?
>
> Regards,
> Felix
>
> [Prike] It's fine to protect this dummy info by checking the parameter 
> amdgpu_exp_hw_support, but it may not friendly to end user by adding the 
> parameter and some guys will still report KFD not enabled for this parameter 
> setting problem. The original idea is the end user will not aware the dummy 
> cache info and only alert the

Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-10-21 Thread Mikhail Gavrilov

On Fri, Oct 21, 2022 at 1:33 PM Christian König
 wrote:
>
> Hi,
>
> yes Bas already reported this issue, but I couldn't reproduce it. Need
> to come up with a patch to narrow this down further.
>
> Can I send you something to test?

I would appreciate to test any patches and ideas.

-- 
Best Regards,
Mike Gavrilov.

Re: [Intel-gfx] [PATCH v7 0/9] dyndbg: drm.debug adaptation

2022-10-21 Thread Jani Nikula

On Thu, 20 Oct 2022, Ville Syrjälä  wrote:
> On Sat, Sep 24, 2022 at 03:02:34PM +0200, Greg KH wrote:
>> On Sun, Sep 11, 2022 at 11:28:43PM -0600, Jim Cromie wrote:
>> > hi Greg, Dan, Jason, DRM-folk,
>> > 
>> > heres follow-up to V6:
>> >   rebased on driver-core/driver-core-next for -v6 applied bits (thanks)
>> >   rework drm_debug_enabled{_raw,_instrumented,} per Dan.
>> > 
>> > It excludes:
>> >   nouveau parts (immature)
>> >   tracefs parts (I missed --to=Steve on v6)
>> >   split _ddebug_site and de-duplicate experiment (way unready)
>> > 
>> > IOW, its the remaining commits of V6 on which Dan gave his Reviewed-by.
>> > 
>> > If these are good to apply, I'll rebase and repost the rest separately.
>> 
>> All now queued up, thanks.
>
> This stuff broke i915 debugs. When I first load i915 no debug prints are
> produced. If I then go fiddle around in /sys/module/drm/parameters/debug
> the debug prints start to suddenly work.

Wait what? I always assumed the default behaviour would stay the same,
which is usually how we roll. It's a regression in my books. We've got a
CI farm that's not very helpful in terms of dmesg logging right now
because of this.

BR,
Jani.


-- 
Jani Nikula, Intel Open Source Graphics Center

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Luben Tuikov

On 2022-10-21 04:59, Yang Yingliang wrote:
> 
> On 2022/10/21 16:36, Greg KH wrote:
>> On Fri, Oct 21, 2022 at 04:24:23PM +0800, Yang Yingliang wrote:
>>> On 2022/10/21 13:37, Greg KH wrote:
 On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
> On 2022-10-20 22:20, Yang Yingliang wrote:
>> The previous discussion link:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7C74aa9b57192b406ef27408dab3429db4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019395979868103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RcK05cXm1J5%2BtYcLO2SMG7k6sjeymQzdBzMCDJSzfdE%3D&reserved=0
> The very first discussion on this was here:
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7C74aa9b57192b406ef27408dab3429db4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019395979868103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sHZ6kfLF8HxrNXV6%2FVjgdH%2BmQM4T3Zv0U%2FAwddT97cE%3D&reserved=0
>
> Please use this link, and not the that one up there you which quoted 
> above,
> and whose commit description is taken verbatim from the this link.
>
>> kset_register() is currently used in some places without calling
>> kset_put() in error path, because the callers think it should be
>> kset internal thing to do, but the driver core can not know what
>> caller doing with that memory at times. The memory could be freed
>> both in kset_put() and error path of caller, if it is called in
>> kset_register().
> As I explained in the link above, the reason there's
> a memory leak is that one cannot call kset_register() without
> the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
> in this case, i.e. kset_register() fails with -EINVAL.
>
> Thus, the most common usage is something like this:
>
>   kobj_set_name(&kset->kobj, format, ...);
>   kset->kobj.kset = parent_kset;
>   kset->kobj.ktype = ktype;
>   res = kset_register(kset);
>
> So, what is being leaked, is the memory allocated in kobj_set_name(),
> by the common idiom shown above. This needs to be mentioned in
> the documentation, at least, in case, in the future this is absolved
> in kset_register() redesign, etc.
 Based on this, can kset_register() just clean up from itself when an
 error happens?  Ideally that would be the case, as the odds of a kset
 being embedded in a larger structure is probably slim, but we would have
 to search the tree to make sure.
>>> I have search the whole tree, the kset used in bus_register() - patch #3,
>>> kset_create_and_add() - patch #4
>>> __class_register() - patch #5,  fw_cfg_build_symlink() - patch #6 and
>>> amdgpu_discovery.c - patch #10
>>> is embedded in a larger structure. In these cases, we can not call
>>> kset_put() in error path in kset_register()
>> Yes you can as the kobject in the kset should NOT be controling the
>> lifespan of those larger objects.
>>
>> If it is, please point out the call chain here as I don't think that
>> should be possible.
>>
>> Note all of this is a mess because the kobject name stuff was added much
>> later, after the driver model had been created and running for a while.
>> We missed this error path when adding the dynamic kobject name logic,
>> thank for looking into this.
>>
>> If you could test the patch posted with your error injection systems,
>> that could make this all much simpler to solve.
> 
> The patch posted by Luben will cause double free in some cases.

Yes, I figured this out in the other email and posted the scenario Greg
was asking about.

But I believe the question still stands if we can do kset_put()
after a *failed* kset_register(), namely if more is being done than
necessary, which is just to free the memory allocated by
kobject_set_name().

Regards,
Luben

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Luben Tuikov

On 2022-10-21 04:36, Greg KH wrote:
> On Fri, Oct 21, 2022 at 04:24:23PM +0800, Yang Yingliang wrote:
>>
>> On 2022/10/21 13:37, Greg KH wrote:
>>> On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
 On 2022-10-20 22:20, Yang Yingliang wrote:
> The previous discussion link:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7Ca8206f9348e04b13e3da08dab33f4f53%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019381738406942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HqvF1p4ejF6%2BYS5u0pe15ZdDgUAIVP%2BB1xQXICWjNwY%3D&reserved=0
 The very first discussion on this was here:

 https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7Ca8206f9348e04b13e3da08dab33f4f53%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019381738406942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LmRgWUSQgK6wJvMdfBgjO4CiaKQ2TBoeW836r0UbcjU%3D&reserved=0

 Please use this link, and not the that one up there you which quoted above,
 and whose commit description is taken verbatim from the this link.

> kset_register() is currently used in some places without calling
> kset_put() in error path, because the callers think it should be
> kset internal thing to do, but the driver core can not know what
> caller doing with that memory at times. The memory could be freed
> both in kset_put() and error path of caller, if it is called in
> kset_register().
 As I explained in the link above, the reason there's
 a memory leak is that one cannot call kset_register() without
 the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
 in this case, i.e. kset_register() fails with -EINVAL.

 Thus, the most common usage is something like this:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

 So, what is being leaked, is the memory allocated in kobj_set_name(),
 by the common idiom shown above. This needs to be mentioned in
 the documentation, at least, in case, in the future this is absolved
 in kset_register() redesign, etc.
>>> Based on this, can kset_register() just clean up from itself when an
>>> error happens?  Ideally that would be the case, as the odds of a kset
>>> being embedded in a larger structure is probably slim, but we would have
>>> to search the tree to make sure.
>> I have search the whole tree, the kset used in bus_register() - patch #3,
>> kset_create_and_add() - patch #4
>> __class_register() - patch #5,  fw_cfg_build_symlink() - patch #6 and
>> amdgpu_discovery.c - patch #10
>> is embedded in a larger structure. In these cases, we can not call
>> kset_put() in error path in kset_register()
> 
> Yes you can as the kobject in the kset should NOT be controling the
> lifespan of those larger objects.
> 
> If it is, please point out the call chain here as I don't think that
> should be possible.

WLOG, I believe it is something like this:

x = kzalloc();

kobject_set_name(&x->kset.kobj, format, ...);
x->kset.kobj.kset = parent_kset;
x->kset.kobj.ktype = this_ktype;  /* this has a .release which frees x 
*/
res = kset_register(&x->kset);
if (res) {
kset_put(&x->kset);   /* calls this_ktype->release() which 
frees x */ 
kfree(x); /* <-- double free */
}

And since kref is set to 1, in kset_register(), then we'd double free.
This is why I don't have kset_put() in that error path in amdgpu.

Regards,
Luben

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Luben Tuikov

On 2022-10-21 04:24, Luben Tuikov wrote:
> On 2022-10-21 04:18, Greg KH wrote:
>> On Fri, Oct 21, 2022 at 03:55:18AM -0400, Luben Tuikov wrote:
>>> On 2022-10-21 01:37, Greg KH wrote:
 On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
> On 2022-10-20 22:20, Yang Yingliang wrote:
>> The previous discussion link:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7Cd41da3fd6449492d01f808dab33cdb75%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019371236833115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=C%2Bj1THkHpzVGks5eqB%2Fm%2FPAkMRohR7CYvRnOCqUqdcM%3D&reserved=0
>
> The very first discussion on this was here:
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7Cd41da3fd6449492d01f808dab33cdb75%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019371236833115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pSR10abmK8nAMvKSezqWC0SPUBL4qEwtCCizyIKW7Dc%3D&reserved=0
>
> Please use this link, and not the that one up there you which quoted 
> above,
> and whose commit description is taken verbatim from the this link.
>
>>
>> kset_register() is currently used in some places without calling
>> kset_put() in error path, because the callers think it should be
>> kset internal thing to do, but the driver core can not know what
>> caller doing with that memory at times. The memory could be freed
>> both in kset_put() and error path of caller, if it is called in
>> kset_register().
>
> As I explained in the link above, the reason there's
> a memory leak is that one cannot call kset_register() without
> the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
> in this case, i.e. kset_register() fails with -EINVAL.
>
> Thus, the most common usage is something like this:
>
>   kobj_set_name(&kset->kobj, format, ...);
>   kset->kobj.kset = parent_kset;
>   kset->kobj.ktype = ktype;
>   res = kset_register(kset);
>
> So, what is being leaked, is the memory allocated in kobj_set_name(),
> by the common idiom shown above. This needs to be mentioned in
> the documentation, at least, in case, in the future this is absolved
> in kset_register() redesign, etc.

 Based on this, can kset_register() just clean up from itself when an
 error happens?  Ideally that would be the case, as the odds of a kset
 being embedded in a larger structure is probably slim, but we would have
 to search the tree to make sure.
>>>
>>> Looking at kset_register(), we can add kset_put() in the error path,
>>> when kobject_add_internal(&kset->kobj) fails.
>>>
>>> See the attached patch. It needs to be tested with the same error injection
>>> as Yang has been doing.
>>>
>>> Now, struct kset is being embedded in larger structs--see amdgpu_discovery.c
>>> starting at line 575. If you're on an AMD system, it gets you the tree
>>> structure you'll see when you run "tree 
>>> /sys/class/drm/card0/device/ip_discovery/".
>>> That shouldn't be a problem though.
>>
>> Yes, that shouldn't be an issue as the kobject embedded in a kset is
>> ONLY for that kset itself, the kset structure should not be controling
>> the lifespan of the object it is embedded in, right?
> 
> Yes, and it doesn't. It only does a kobject_get(parent) and 
> kobject_put(parent).
> So that's fine and natural.
> 
> Yang, do you want to try the patch in my previous email in this thread, since 
> you've
> got the error injection set up already?

I spoke too soon. I believe you're onto something, because looking at the idiom:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

The ktype defines a release method, which frees the larger struct the kset is 
embedded in.
And this would result in double free, for instance in the amdgpu_discovery.c 
code, if
kset_put() is called after kset_register() fails, since we kzalloc the larger 
object
just before and kfree it on error just after. Ideally, we'd only "revert" the 
actions
done by kobj_set_name(), as there's some error recovery on create_dir() in 
kobject_add_internal().

So, we cannot do this business with the kset_put() on error from 
kset_register(), after all.
Not sure how this wasn't caught in Yang's testing--the kernel should've 
complained.

Regards,
Luben

Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-10-21 Thread Christian König


Hi,

yes Bas already reported this issue, but I couldn't reproduce it. Need 
to come up with a patch to narrow this down further.


Can I send you something to test?

Thanks for the help,
Christian.

Am 21.10.22 um 10:08 schrieb Mikhail Gavrilov:

Hi!
I found that some games (Cyberpunk 2077, Forza Horizon 4/5) hang at
start after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6.

dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit
commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6
Author: Christian König 
Date:   Thu Jul 14 10:23:38 2022 +0200

 drm/amdgpu: revert "partial revert "remove ctx->lock" v2"

 This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec.

 We found that the bo_list is missing a protection for its list entries.
 Since that is fixed now this workaround can be removed again.

 Signed-off-by: Christian König 
 Reviewed-by: Alex Deucher 
 Signed-off-by: Alex Deucher 

  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c |  2 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  1 -
  3 files changed, 6 insertions(+), 18 deletions(-)


And when it happening in kernel log appears a such backtrace:
[  231.331210] [ cut here ]
[  231.331262] WARNING: CPU: 11 PID: 6555 at
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:675
amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  231.331424] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir
snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek snd_sof
snd_hda_codec_generic snd_hda_codec_hdmi snd_sof_utils mt7921e
snd_hda_intel sunrpc snd_intel_dspcfg mt7921_common binfmt_misc
snd_intel_sdw_acpi snd_hda_codec mt76_connac_lib edac_mce_amd btusb
snd_soc_core mt76 snd_hda_core btrtl snd_hwdep snd_compress kvm_amd
ac97_bus snd_seq btbcm snd_pcm_dmaengine btintel snd_rpl_pci_acp6x
mac80211 btmtk snd_pci_acp6x kvm snd_seq_device snd_pcm snd_pci_acp5x
libarc4 irqbypass bluetooth snd_rn_pci_acp3x snd_timer pcspkr
asus_nb_wmi rapl joydev wmi_bmof snd_acp_config cfg80211 snd_soc_acpi
vfat snd
[  231.331490]  snd_pci_acp3x i2c_piix4 soundcore fat k10temp amd_pmc
asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi
iommu_v2 crct10dif_pclmul crc32_pclmul gpu_sched crc32c_intel
ledtrig_audio sparse_keymap polyval_clmulni platform_profile drm_buddy
polyval_generic hid_multitouch drm_display_helper rfkill nvme
ucsi_acpi ghash_clmulni_intel nvme_core video typec_ucsi serio_raw ccp
sha512_ssse3 sp5100_tco r8169 cec nvme_common typec wmi i2c_hid_acpi
i2c_hid ip6_tables ip_tables fuse
[  231.331532] CPU: 11 PID: 6555 Comm: GameThread Tainted: GW
   L---  ---
6.1.0-0.rc1.20221019gitaae703b02f92.17.fc38.x86_64 #1
[  231.331534] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[  231.331537] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  231.331654] Code: a8 d0 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 40
82 f3 c0 48 c7 c7 10 60 14 c1 e8 2f a0 f4 d0 eb 8e 66 90 bd f2 ff ff
ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff
ff 48
[  231.331656] RSP: 0018:aad4c705bae8 EFLAGS: 00010286
[  231.331659] RAX: 8e9cbdbe3200 RBX: 8e997e3f2440 RCX: 
[  231.331661] RDX:  RSI: 8e9cbdbe3200 RDI: 8e9c31208000
[  231.331663] RBP: 0001 R08: 0dc0 R09: 
[  231.331665] R10: 0001 R11:  R12: aad4c705bb90
[  231.331666] R13: 7651 R14: 8e9c89f334e0 R15: 8e991fda8000
[  231.331668] FS:  7c2af6c0() GS:8ea7d8e0()
knlGS:7b2c
[  231.331671] CS:  0010 DS:  ES:  CR0: 80050033
[  231.331673] CR2: 7ff65ffd8000 CR3: 0004f90f CR4: 00750ee0
[  231.331674] PKRU: 5554
[  231.331676] Call Trace:
[  231.331678]  
[  231.331682]  amdgpu_cs_ioctl+0x87e/0x1fc0 [amdgpu]
[  231.331824]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  231.331981]  drm_ioctl_kernel+0xac/0x160
[  231.331990]  drm_ioctl+0x1e7/0x450
[  231.331994]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  231.332118]  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[  231.332233]  __x64_sys_ioctl+0x90/0xd0
[  231.332238]  do_syscall_64+0x5b/0x80
[  231.332243]  ? asm_exc_page_fault+0x22/0x30
[  231.332247]  ? lockdep_hardirqs_on+0x7d/0x100
[  231.332250]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  231.332253] RIP: 0033:0x7ff677c5704f
[  231.332256] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Luben Tuikov

On 2022-10-21 04:18, Greg KH wrote:
> On Fri, Oct 21, 2022 at 03:55:18AM -0400, Luben Tuikov wrote:
>> On 2022-10-21 01:37, Greg KH wrote:
>>> On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
 On 2022-10-20 22:20, Yang Yingliang wrote:
> The previous discussion link:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7Cd41da3fd6449492d01f808dab33cdb75%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019371236833115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=C%2Bj1THkHpzVGks5eqB%2Fm%2FPAkMRohR7CYvRnOCqUqdcM%3D&reserved=0

 The very first discussion on this was here:

 https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7Cd41da3fd6449492d01f808dab33cdb75%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019371236833115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pSR10abmK8nAMvKSezqWC0SPUBL4qEwtCCizyIKW7Dc%3D&reserved=0

 Please use this link, and not the that one up there you which quoted above,
 and whose commit description is taken verbatim from the this link.

>
> kset_register() is currently used in some places without calling
> kset_put() in error path, because the callers think it should be
> kset internal thing to do, but the driver core can not know what
> caller doing with that memory at times. The memory could be freed
> both in kset_put() and error path of caller, if it is called in
> kset_register().

 As I explained in the link above, the reason there's
 a memory leak is that one cannot call kset_register() without
 the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
 in this case, i.e. kset_register() fails with -EINVAL.

 Thus, the most common usage is something like this:

kobj_set_name(&kset->kobj, format, ...);
kset->kobj.kset = parent_kset;
kset->kobj.ktype = ktype;
res = kset_register(kset);

 So, what is being leaked, is the memory allocated in kobj_set_name(),
 by the common idiom shown above. This needs to be mentioned in
 the documentation, at least, in case, in the future this is absolved
 in kset_register() redesign, etc.
>>>
>>> Based on this, can kset_register() just clean up from itself when an
>>> error happens?  Ideally that would be the case, as the odds of a kset
>>> being embedded in a larger structure is probably slim, but we would have
>>> to search the tree to make sure.
>>
>> Looking at kset_register(), we can add kset_put() in the error path,
>> when kobject_add_internal(&kset->kobj) fails.
>>
>> See the attached patch. It needs to be tested with the same error injection
>> as Yang has been doing.
>>
>> Now, struct kset is being embedded in larger structs--see amdgpu_discovery.c
>> starting at line 575. If you're on an AMD system, it gets you the tree
>> structure you'll see when you run "tree 
>> /sys/class/drm/card0/device/ip_discovery/".
>> That shouldn't be a problem though.
> 
> Yes, that shouldn't be an issue as the kobject embedded in a kset is
> ONLY for that kset itself, the kset structure should not be controling
> the lifespan of the object it is embedded in, right?

Yes, and it doesn't. It only does a kobject_get(parent) and kobject_put(parent).
So that's fine and natural.

Yang, do you want to try the patch in my previous email in this thread, since 
you've
got the error injection set up already?

Regards,
Luben

RE: [PATCH 4/4] drm/amdgpu: remove ras_error_status parameter for UMC poison handler

2022-10-21 Thread Zhang, Hawking

[AMD Official Use Only - General]

Series is

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Zhou1, Tao  
Sent: Friday, October 21, 2022 15:36
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
Yang, Stanley ; Chai, Thomas ; Li, 
Candice 
Cc: Zhou1, Tao 
Subject: [PATCH 4/4] drm/amdgpu: remove ras_error_status parameter for UMC 
poison handler

Make the code more simple.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  4 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c|  3 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c| 13 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h|  4 +---
 4 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 0561812aa0a4..37db39ba8718 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -753,9 +753,7 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
amdgpu_device *adev)
 
 void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev, 
bool reset)  {
-   struct ras_err_data err_data = {0, 0, 0, NULL};
-
-   amdgpu_umc_poison_handler(adev, &err_data, reset);
+   amdgpu_umc_poison_handler(adev, reset);
 }
 
 bool amdgpu_amdkfd_ras_query_utcl2_poison_status(struct amdgpu_device *adev) 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 28463b47ce33..693bce07eb46 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1561,7 +1561,6 @@ static void 
amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *  {
bool poison_stat = false;
struct amdgpu_device *adev = obj->adev;
-   struct ras_err_data err_data = {0, 0, 0, NULL};
struct amdgpu_ras_block_object *block_obj =
amdgpu_ras_get_ras_block(adev, obj->head.block, 0);
 
@@ -1584,7 +1583,7 @@ static void 
amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
}
 
if (!adev->gmc.xgmi.connected_to_cpu)
-   amdgpu_umc_poison_handler(adev, &err_data, false);
+   amdgpu_umc_poison_handler(adev, false);
 
if (block_obj->hw_ops->handle_poison_consumption)
poison_stat = 
block_obj->hw_ops->handle_poison_consumption(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 758942150c09..f76c19fc0392 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -165,25 +165,22 @@ static int amdgpu_umc_do_page_retirement(struct 
amdgpu_device *adev,
return AMDGPU_RAS_SUCCESS;
 }
 
-int amdgpu_umc_poison_handler(struct amdgpu_device *adev,
-   void *ras_error_status,
-   bool reset)
+int amdgpu_umc_poison_handler(struct amdgpu_device *adev, bool reset)
 {
int ret = AMDGPU_RAS_SUCCESS;
 
if (!adev->gmc.xgmi.connected_to_cpu) {
-   struct ras_err_data *err_data = (struct ras_err_data 
*)ras_error_status;
+   struct ras_err_data err_data = {0, 0, 0, NULL};
struct ras_common_if head = {
.block = AMDGPU_RAS_BLOCK__UMC,
};
struct ras_manager *obj = amdgpu_ras_find_obj(adev, &head);
 
-   ret =
-   amdgpu_umc_do_page_retirement(adev, ras_error_status, 
NULL, reset);
+   ret = amdgpu_umc_do_page_retirement(adev, &err_data, NULL, 
reset);
 
if (ret == AMDGPU_RAS_SUCCESS && obj) {
-   obj->err_data.ue_count += err_data->ue_count;
-   obj->err_data.ce_count += err_data->ce_count;
+   obj->err_data.ue_count += err_data.ue_count;
+   obj->err_data.ce_count += err_data.ce_count;
}
} else if (reset) {
/* MCA poison handler is only responsible for GPU reset, diff 
--git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 659a10de29c9..a6951160f13a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -83,9 +83,7 @@ struct amdgpu_umc {
 };
 
 int amdgpu_umc_ras_late_init(struct amdgpu_device *adev, struct ras_common_if 
*ras_block); -int amdgpu_umc_poison_handler(struct amdgpu_device *adev,
-   void *ras_error_status,
-   bool reset);
+int amdgpu_umc_poison_handler(struct amdgpu_device *adev, bool reset);
 int amdgpu_umc_process_ecc_irq(struct amdgpu_device *adev,
struct amdgpu_irq_src *source,
struct amdgpu_iv_entry *entry);
--
2.35.1

Re: [PATCH 01/11] kset: fix documentation for kset_register()

2022-10-21 Thread Luben Tuikov

On 2022-10-21 04:05, Yang Yingliang wrote:
> 
> On 2022/10/21 13:34, Luben Tuikov wrote:
>> On 2022-10-20 22:20, Yang Yingliang wrote:
>>> kset_register() is currently used in some places without calling
>>> kset_put() in error path, because the callers think it should be
>>> kset internal thing to do, but the driver core can not know what
>>> caller doing with that memory at times. The memory could be freed
>>> both in kset_put() and error path of caller, if it is called in
>>> kset_register().
>>>
>>> So make the function documentation more explicit about calling
>>> kset_put() in the error path of caller.
>>>
>>> Signed-off-by: Yang Yingliang 
>>> ---
>>>   lib/kobject.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/lib/kobject.c b/lib/kobject.c
>>> index a0b2dbfcfa23..6da04353d974 100644
>>> --- a/lib/kobject.c
>>> +++ b/lib/kobject.c
>>> @@ -834,6 +834,9 @@ EXPORT_SYMBOL_GPL(kobj_sysfs_ops);
>>>   /**
>>>* kset_register() - Initialize and add a kset.
>>>* @k: kset.
>>> + *
>>> + * If this function returns an error, kset_put() must be called to
>>> + * properly clean up the memory associated with the object.
>>>*/
>> And I'd continue the sentence, with " ... with the object,
>> for instance the memory for the kset.kobj.name when 
>> kobj_set_name(&kset.kobj, format, ...)
>> was called before calling kset_register()."
> kobject_cleanup() not only frees name, but aslo calls ->release() to 
> free another resources.

Yes, it does. For this reason I said "for instance..." I didn't want to include
this in case in the future if the code changes, the comment would be wrong. IOW,
I wanted to add the minimalist comment possible.

>>
>> This makes it clear what we want to make sure is freed, in case of an early 
>> error
>> from kset_register().
> 
> How about like this:
> 
> If this function returns an error, kset_put() must be called to clean up the 
> name of
> kset object and other memory associated with the object.

It's bit too wordy and redundant with what else it does--this can be gleaned
from the code. I'd say:

On error, kset_put() should be called to clean up at least 
kset.kobj.name allocated
by kobj_set_name(&kset.kobj, format, ...).

This tells the reader the symmetry of the calls: kobj_set_name() --> 
kset_register() --> kset_put();
Because if the code evolves to use other means of allocation, or if the the 
user allocates a name
by different means, then they'll understand what to watch out for.

Regards,
Luben

> 
>>
>> Regards,
>> Luben
>>
>>>   int kset_register(struct kset *k)
>>>   {
>> .

[6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-10-21 Thread Mikhail Gavrilov

Hi!
I found that some games (Cyberpunk 2077, Forza Horizon 4/5) hang at
start after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6.

dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit
commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6
Author: Christian König 
Date:   Thu Jul 14 10:23:38 2022 +0200

drm/amdgpu: revert "partial revert "remove ctx->lock" v2"

This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec.

We found that the bo_list is missing a protection for its list entries.
Since that is fixed now this workaround can be removed again.

Signed-off-by: Christian König 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  1 -
 3 files changed, 6 insertions(+), 18 deletions(-)


And when it happening in kernel log appears a such backtrace:
[  231.331210] [ cut here ]
[  231.331262] WARNING: CPU: 11 PID: 6555 at
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:675
amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  231.331424] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir
snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek snd_sof
snd_hda_codec_generic snd_hda_codec_hdmi snd_sof_utils mt7921e
snd_hda_intel sunrpc snd_intel_dspcfg mt7921_common binfmt_misc
snd_intel_sdw_acpi snd_hda_codec mt76_connac_lib edac_mce_amd btusb
snd_soc_core mt76 snd_hda_core btrtl snd_hwdep snd_compress kvm_amd
ac97_bus snd_seq btbcm snd_pcm_dmaengine btintel snd_rpl_pci_acp6x
mac80211 btmtk snd_pci_acp6x kvm snd_seq_device snd_pcm snd_pci_acp5x
libarc4 irqbypass bluetooth snd_rn_pci_acp3x snd_timer pcspkr
asus_nb_wmi rapl joydev wmi_bmof snd_acp_config cfg80211 snd_soc_acpi
vfat snd
[  231.331490]  snd_pci_acp3x i2c_piix4 soundcore fat k10temp amd_pmc
asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi
iommu_v2 crct10dif_pclmul crc32_pclmul gpu_sched crc32c_intel
ledtrig_audio sparse_keymap polyval_clmulni platform_profile drm_buddy
polyval_generic hid_multitouch drm_display_helper rfkill nvme
ucsi_acpi ghash_clmulni_intel nvme_core video typec_ucsi serio_raw ccp
sha512_ssse3 sp5100_tco r8169 cec nvme_common typec wmi i2c_hid_acpi
i2c_hid ip6_tables ip_tables fuse
[  231.331532] CPU: 11 PID: 6555 Comm: GameThread Tainted: GW
  L---  ---
6.1.0-0.rc1.20221019gitaae703b02f92.17.fc38.x86_64 #1
[  231.331534] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[  231.331537] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  231.331654] Code: a8 d0 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 40
82 f3 c0 48 c7 c7 10 60 14 c1 e8 2f a0 f4 d0 eb 8e 66 90 bd f2 ff ff
ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff
ff 48
[  231.331656] RSP: 0018:aad4c705bae8 EFLAGS: 00010286
[  231.331659] RAX: 8e9cbdbe3200 RBX: 8e997e3f2440 RCX: 
[  231.331661] RDX:  RSI: 8e9cbdbe3200 RDI: 8e9c31208000
[  231.331663] RBP: 0001 R08: 0dc0 R09: 
[  231.331665] R10: 0001 R11:  R12: aad4c705bb90
[  231.331666] R13: 7651 R14: 8e9c89f334e0 R15: 8e991fda8000
[  231.331668] FS:  7c2af6c0() GS:8ea7d8e0()
knlGS:7b2c
[  231.331671] CS:  0010 DS:  ES:  CR0: 80050033
[  231.331673] CR2: 7ff65ffd8000 CR3: 0004f90f CR4: 00750ee0
[  231.331674] PKRU: 5554
[  231.331676] Call Trace:
[  231.331678]  
[  231.331682]  amdgpu_cs_ioctl+0x87e/0x1fc0 [amdgpu]
[  231.331824]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  231.331981]  drm_ioctl_kernel+0xac/0x160
[  231.331990]  drm_ioctl+0x1e7/0x450
[  231.331994]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  231.332118]  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[  231.332233]  __x64_sys_ioctl+0x90/0xd0
[  231.332238]  do_syscall_64+0x5b/0x80
[  231.332243]  ? asm_exc_page_fault+0x22/0x30
[  231.332247]  ? lockdep_hardirqs_on+0x7d/0x100
[  231.332250]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  231.332253] RIP: 0033:0x7ff677c5704f
[  231.332256] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28
00 00
[  231.332258] RSP: 002b:7c2ad470 EFLAGS: 0246 ORIG_RAX:
0010
[  231.332261] RAX: ffda RBX: 7c2ad718 RCX: 7ff677c5704f
[  231.332263] RDX: 7c2ad540 RSI: c0186444

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Luben Tuikov

On 2022-10-21 01:37, Greg KH wrote:
> On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
>> On 2022-10-20 22:20, Yang Yingliang wrote:
>>> The previous discussion link:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F0db486eb-6927-927e-3629-958f8f211194%40huawei.com%2FT%2F&data=05%7C01%7Cluben.tuikov%40amd.com%7C65b33f087ef245a9f23708dab3264840%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019274318153227%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=1ZoieEob62iU9kI8fvpp20qGut9EeHKIHtCAT01t%2Bz8%3D&reserved=0
>>
>> The very first discussion on this was here:
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg368077.html&data=05%7C01%7Cluben.tuikov%40amd.com%7C65b33f087ef245a9f23708dab3264840%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638019274318153227%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9joWxGLUxZZMvrfkxCR8KbkoXifsqoMK0vGR%2FyEG62w%3D&reserved=0
>>
>> Please use this link, and not the that one up there you which quoted above,
>> and whose commit description is taken verbatim from the this link.
>>
>>>
>>> kset_register() is currently used in some places without calling
>>> kset_put() in error path, because the callers think it should be
>>> kset internal thing to do, but the driver core can not know what
>>> caller doing with that memory at times. The memory could be freed
>>> both in kset_put() and error path of caller, if it is called in
>>> kset_register().
>>
>> As I explained in the link above, the reason there's
>> a memory leak is that one cannot call kset_register() without
>> the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
>> in this case, i.e. kset_register() fails with -EINVAL.
>>
>> Thus, the most common usage is something like this:
>>
>>  kobj_set_name(&kset->kobj, format, ...);
>>  kset->kobj.kset = parent_kset;
>>  kset->kobj.ktype = ktype;
>>  res = kset_register(kset);
>>
>> So, what is being leaked, is the memory allocated in kobj_set_name(),
>> by the common idiom shown above. This needs to be mentioned in
>> the documentation, at least, in case, in the future this is absolved
>> in kset_register() redesign, etc.
> 
> Based on this, can kset_register() just clean up from itself when an
> error happens?  Ideally that would be the case, as the odds of a kset
> being embedded in a larger structure is probably slim, but we would have
> to search the tree to make sure.

Looking at kset_register(), we can add kset_put() in the error path,
when kobject_add_internal(&kset->kobj) fails.

See the attached patch. It needs to be tested with the same error injection
as Yang has been doing.

Now, struct kset is being embedded in larger structs--see amdgpu_discovery.c
starting at line 575. If you're on an AMD system, it gets you the tree
structure you'll see when you run "tree 
/sys/class/drm/card0/device/ip_discovery/".
That shouldn't be a problem though.

Regards,
LubenFrom 71e0a22801c0699f67ea40ed96e0a7d7d9e0f318 Mon Sep 17 00:00:00 2001
From: Luben Tuikov 
Date: Fri, 21 Oct 2022 03:34:21 -0400
Subject: [PATCH] kobject: Add kset_put() if kset_register() fails
X-check-string-leak: v1.0

If kset_register() fails, we call kset_put() before returning the
error. This makes sure that we free memory allocated by kobj_set_name()
for the kset, since kset_register() cannot be called unless the kset has
a name, usually gotten via kobj_set_name(&kset->kobj, format, ...);

Cc: Greg Kroah-Hartman 
Cc: Rafael J. Wysocki 
Cc: Yang Yingliang 
Cc: Linux Kernel Mailing List 
Signed-off-by: Luben Tuikov 
---
 lib/kobject.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/kobject.c b/lib/kobject.c
index a0b2dbfcfa2334..c122b979f2b75e 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -844,8 +844,10 @@ int kset_register(struct kset *k)
 
 	kset_init(k);
 	err = kobject_add_internal(&k->kobj);
-	if (err)
+	if (err) {
+		kset_put(k);
 		return err;
+	}
 	kobject_uevent(&k->kobj, KOBJ_ADD);
 	return 0;
 }
-- 
2.38.0-rc2

Re: [PATCH 1/5] drm/amdgpu: Introduce gfx software ring (v8)

2022-10-21 Thread Michel Dänzer

On 2022-10-20 16:59, Christian König wrote:
> Am 20.10.22 um 16:49 schrieb Michel Dänzer:
>> On 2022-10-18 11:08, jiadong@amd.com wrote:
>>> From: "Jiadong.Zhu" 
>>>
>>> The software ring is created to support priority context while there is only
>>> one hardware queue for gfx.
>>>
>>> Every software ring has its fence driver and could be used as an ordinary 
>>> ring
>>> for the GPU scheduler.
>>> Multiple software rings are bound to a real ring with the ring muxer. The
>>> packages committed on the software ring are copied to the real ring.
>>>
>>> v2: Use array to store software ring entry.
>>> v3: Remove unnecessary prints.
>>> v4: Remove amdgpu_ring_sw_init/fini functions,
>>> using gtt for sw ring buffer for later dma copy
>>> optimization.
>>> v5: Allocate ring entry dynamically in the muxer.
>>> v6: Update comments for the ring muxer.
>>> v7: Modify for function naming.
>>> v8: Combine software ring functions into amdgpu_ring_mux.c
>> I tested patches 1-4 of this series (since Christian clearly nacked patch 
>> 5). I hit the oops below.
> 
> As long as you don't try to reset the GPU you can also test patch 5.

Sure, I can test it once there's a fix for the oops.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer

[PATCH 4/4] drm/amdgpu: remove ras_error_status parameter for UMC poison handler

2022-10-21 Thread Tao Zhou

Make the code more simple.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  4 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c|  3 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c| 13 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h|  4 +---
 4 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 0561812aa0a4..37db39ba8718 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -753,9 +753,7 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
amdgpu_device *adev)
 
 void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev, 
bool reset)
 {
-   struct ras_err_data err_data = {0, 0, 0, NULL};
-
-   amdgpu_umc_poison_handler(adev, &err_data, reset);
+   amdgpu_umc_poison_handler(adev, reset);
 }
 
 bool amdgpu_amdkfd_ras_query_utcl2_poison_status(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 28463b47ce33..693bce07eb46 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1561,7 +1561,6 @@ static void 
amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
 {
bool poison_stat = false;
struct amdgpu_device *adev = obj->adev;
-   struct ras_err_data err_data = {0, 0, 0, NULL};
struct amdgpu_ras_block_object *block_obj =
amdgpu_ras_get_ras_block(adev, obj->head.block, 0);
 
@@ -1584,7 +1583,7 @@ static void 
amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
}
 
if (!adev->gmc.xgmi.connected_to_cpu)
-   amdgpu_umc_poison_handler(adev, &err_data, false);
+   amdgpu_umc_poison_handler(adev, false);
 
if (block_obj->hw_ops->handle_poison_consumption)
poison_stat = 
block_obj->hw_ops->handle_poison_consumption(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 758942150c09..f76c19fc0392 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -165,25 +165,22 @@ static int amdgpu_umc_do_page_retirement(struct 
amdgpu_device *adev,
return AMDGPU_RAS_SUCCESS;
 }
 
-int amdgpu_umc_poison_handler(struct amdgpu_device *adev,
-   void *ras_error_status,
-   bool reset)
+int amdgpu_umc_poison_handler(struct amdgpu_device *adev, bool reset)
 {
int ret = AMDGPU_RAS_SUCCESS;
 
if (!adev->gmc.xgmi.connected_to_cpu) {
-   struct ras_err_data *err_data = (struct ras_err_data 
*)ras_error_status;
+   struct ras_err_data err_data = {0, 0, 0, NULL};
struct ras_common_if head = {
.block = AMDGPU_RAS_BLOCK__UMC,
};
struct ras_manager *obj = amdgpu_ras_find_obj(adev, &head);
 
-   ret =
-   amdgpu_umc_do_page_retirement(adev, ras_error_status, 
NULL, reset);
+   ret = amdgpu_umc_do_page_retirement(adev, &err_data, NULL, 
reset);
 
if (ret == AMDGPU_RAS_SUCCESS && obj) {
-   obj->err_data.ue_count += err_data->ue_count;
-   obj->err_data.ce_count += err_data->ce_count;
+   obj->err_data.ue_count += err_data.ue_count;
+   obj->err_data.ce_count += err_data.ce_count;
}
} else if (reset) {
/* MCA poison handler is only responsible for GPU reset,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 659a10de29c9..a6951160f13a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -83,9 +83,7 @@ struct amdgpu_umc {
 };
 
 int amdgpu_umc_ras_late_init(struct amdgpu_device *adev, struct ras_common_if 
*ras_block);
-int amdgpu_umc_poison_handler(struct amdgpu_device *adev,
-   void *ras_error_status,
-   bool reset);
+int amdgpu_umc_poison_handler(struct amdgpu_device *adev, bool reset);
 int amdgpu_umc_process_ecc_irq(struct amdgpu_device *adev,
struct amdgpu_irq_src *source,
struct amdgpu_iv_entry *entry);
-- 
2.35.1

[PATCH 3/4] drm/amdgpu: add RAS poison handling for MCA

2022-10-21 Thread Tao Zhou

For MCA poison, if unmap queue fails, only gpu reset should be
triggered without page retirement handling, MCA notifier will do it.

v2: handle MCA poison consumption in umc_poison_handler directly.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 31 -
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 3c83129f4090..758942150c09 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -169,19 +169,28 @@ int amdgpu_umc_poison_handler(struct amdgpu_device *adev,
void *ras_error_status,
bool reset)
 {
-   int ret;
-   struct ras_err_data *err_data = (struct ras_err_data *)ras_error_status;
-   struct ras_common_if head = {
-   .block = AMDGPU_RAS_BLOCK__UMC,
-   };
-   struct ras_manager *obj = amdgpu_ras_find_obj(adev, &head);
+   int ret = AMDGPU_RAS_SUCCESS;
 
-   ret =
-   amdgpu_umc_do_page_retirement(adev, ras_error_status, NULL, 
reset);
+   if (!adev->gmc.xgmi.connected_to_cpu) {
+   struct ras_err_data *err_data = (struct ras_err_data 
*)ras_error_status;
+   struct ras_common_if head = {
+   .block = AMDGPU_RAS_BLOCK__UMC,
+   };
+   struct ras_manager *obj = amdgpu_ras_find_obj(adev, &head);
 
-   if (ret == AMDGPU_RAS_SUCCESS && obj) {
-   obj->err_data.ue_count += err_data->ue_count;
-   obj->err_data.ce_count += err_data->ce_count;
+   ret =
+   amdgpu_umc_do_page_retirement(adev, ras_error_status, 
NULL, reset);
+
+   if (ret == AMDGPU_RAS_SUCCESS && obj) {
+   obj->err_data.ue_count += err_data->ue_count;
+   obj->err_data.ce_count += err_data->ce_count;
+   }
+   } else if (reset) {
+   /* MCA poison handler is only responsible for GPU reset,
+* let MCA notifier do page retirement.
+*/
+   kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+   amdgpu_ras_reset_gpu(adev);
}
 
return ret;
-- 
2.35.1

[PATCH 2/4] drm/amdgpu: use page retirement API in MCA notifier

2022-10-21 Thread Tao Zhou

Make the code more readable.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 36 +++--
 1 file changed, 3 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 21a47f2bb87b..28463b47ce33 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -36,7 +36,6 @@
 #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
 #include "atom.h"
 #include "amdgpu_reset.h"
-#include "umc_v6_7.h"
 
 #ifdef CONFIG_X86_MCE_AMD
 #include 
@@ -2849,7 +2848,6 @@ static int amdgpu_bad_page_notifier(struct notifier_block 
*nb,
struct amdgpu_device *adev = NULL;
uint32_t gpu_id = 0;
uint32_t umc_inst = 0, ch_inst = 0;
-   struct ras_err_data err_data = {0, 0, 0, NULL};
 
/*
 * If the error was generated in UMC_V2, which belongs to GPU UMCs,
@@ -2888,38 +2886,10 @@ static int amdgpu_bad_page_notifier(struct 
notifier_block *nb,
dev_info(adev->dev, "Uncorrectable error detected in UMC inst: %d, 
chan_idx: %d",
 umc_inst, ch_inst);
 
-   err_data.err_addr =
-   kcalloc(adev->umc.max_ras_err_cnt_per_query,
-   sizeof(struct eeprom_table_record), GFP_KERNEL);
-   if (!err_data.err_addr) {
-   dev_warn(adev->dev,
-   "Failed to alloc memory for umc error record in mca 
notifier!\n");
-   return NOTIFY_DONE;
-   }
-
-   /*
-* Translate UMC channel address to Physical address
-*/
-   switch (adev->ip_versions[UMC_HWIP][0]) {
-   case IP_VERSION(6, 7, 0):
-   umc_v6_7_convert_error_address(adev,
-   &err_data, m->addr, ch_inst, umc_inst);
-   break;
-   default:
-   dev_warn(adev->dev,
-"UMC address to Physical address translation is not 
supported\n");
-   kfree(err_data.err_addr);
+   if (!amdgpu_umc_page_retirement_mca(adev, m->addr, ch_inst, umc_inst))
+   return NOTIFY_OK;
+   else
return NOTIFY_DONE;
-   }
-
-   if (amdgpu_bad_page_threshold != 0) {
-   amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
-   err_data.err_addr_cnt);
-   amdgpu_ras_save_bad_pages(adev);
-   }
-
-   kfree(err_data.err_addr);
-   return NOTIFY_OK;
 }
 
 static struct notifier_block amdgpu_bad_page_nb = {
-- 
2.35.1

[PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

2022-10-21 Thread Tao Zhou

Define page retirement functions for MCA platform.

v2: remove page retirement handling from MCA poison handler,
let MCA notifier do page retirement.

v3: remove specific poison handler for MCA to simplify code.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 53 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |  2 +
 2 files changed, 55 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index aad3c8b4c810..3c83129f4090 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -22,6 +22,59 @@
  */
 
 #include "amdgpu.h"
+#include "umc_v6_7.h"
+
+static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, uint64_t 
err_addr,
+   uint32_t ch_inst, uint32_t umc_inst)
+{
+   switch (adev->ip_versions[UMC_HWIP][0]) {
+   case IP_VERSION(6, 7, 0):
+   umc_v6_7_convert_error_address(adev,
+   err_data, err_addr, ch_inst, umc_inst);
+   break;
+   default:
+   dev_warn(adev->dev,
+"UMC address to Physical address translation is not 
supported\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
+
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst)
+{
+   struct ras_err_data err_data = {0, 0, 0, NULL};
+   int ret = AMDGPU_RAS_FAIL;
+
+   err_data.err_addr =
+   kcalloc(adev->umc.max_ras_err_cnt_per_query,
+   sizeof(struct eeprom_table_record), GFP_KERNEL);
+   if (!err_data.err_addr) {
+   dev_warn(adev->dev,
+   "Failed to alloc memory for umc error record in MCA 
notifier!\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   /*
+* Translate UMC channel address to Physical address
+*/
+   ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
+   ch_inst, umc_inst);
+   if (ret)
+   goto out;
+
+   if (amdgpu_bad_page_threshold != 0) {
+   amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
+   err_data.err_addr_cnt);
+   amdgpu_ras_save_bad_pages(adev);
+   }
+
+out:
+   kfree(err_data.err_addr);
+   return ret;
+}
 
 static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
void *ras_error_status,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 3629d8f292ef..659a10de29c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -98,4 +98,6 @@ void amdgpu_umc_fill_error_record(struct ras_err_data 
*err_data,
 int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
void *ras_error_status,
struct amdgpu_iv_entry *entry);
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst);
 #endif
-- 
2.35.1

Re: [PATCH] drm/amdgpu: fix sdma doorbell init ordering on APUs

2022-10-21 Thread Shuah Khan


On 10/19/22 21:48, Alex Deucher wrote:

Commit 8795e182b02d ("PCI/portdrv: Don't disable AER reporting in 
get_port_device_capability()")
uncovered a bug in amdgpu that required a reordering of the driver
init sequence to avoid accessing a special register on the GPU
before it was properly set up leading to an PCI AER error.  This
reordering uncovered a different hw programming ordering dependency
in some APUs where the SDMA doorbells need to be programmed before
the GFX doorbells. To fix this, move the SDMA doorbell programming
back into the soc15 common code, but use the actual doorbell range
values directly rather than the values stored in the ring structure
since those will not be initialized at this point.

This is a partial revert, but with the doorbell assignment
fixed so the proper doorbell index is set before it's used.

Fixes: e3163bc8ffdfdb ("drm/amdgpu: move nbio sdma_doorbell_range() into sdma code 
for vega")
Signed-off-by: Alex Deucher 
Cc: sk...@linuxfoundation.org


Thank you for fixing this quickly and getting me back to 6.1-rc1
on my primary system.

Reported-and-Tested-by: Shuah Khan 

thanks,
-- Shuah

[PATCH 05/11] class: fix possible memory leak in __class_register()

2022-10-21 Thread Yang Yingliang

Inject fault while loading module (e.g. pktcdvd.ko), kset_register() may
fail in __class_register(), if it fails, but the refcount of kobject is
not decreased to 0, the name allocated in kobject_set_name() is leaked.
To fix this by calling kfree_const().

unreferenced object 0x888102fa8190 (size 8):
  comm "modprobe", pid 502, jiffies 4294906074 (age 49.296s)
  hex dump (first 8 bytes):
70 6b 74 63 64 76 64 00  pktcdvd.
  backtrace:
[] __kmalloc_track_caller+0x1ae/0x320
[<5e4d70bc>] kstrdup+0x3a/0x70
[] kstrdup_const+0x68/0x80
[<0049a8c7>] kvasprintf_const+0x10b/0x190
[<29123163>] kobject_set_name_vargs+0x56/0x150
[<747219c9>] kobject_set_name+0xab/0xe0
[<05f1ea4e>] __class_register+0x15c/0x49a

If class_add_groups() fails, it need delete kobject and free its name,
besides, subsys_private also need be freed.

unreferenced object 0x888037274000 (size 1024):
  comm "modprobe", pid 502, jiffies 4294906074 (age 49.296s)
  hex dump (first 32 bytes):
00 40 27 37 80 88 ff ff 00 40 27 37 80 88 ff ff  .@'7.@'7
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
  backtrace:
[<151f9600>] kmem_cache_alloc_trace+0x17c/0x2f0
[] __class_register+0x86/0x49a

It can not call kset_put() or kset_unregister() in error path, because
the 'cls' will be freed in callback function class_release() and it also
freed in error path, it will cause double free.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Yang Yingliang 
---
 drivers/base/class.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/base/class.c b/drivers/base/class.c
index 64f7b9a0970f..87de0a04ee9b 100644
--- a/drivers/base/class.c
+++ b/drivers/base/class.c
@@ -187,11 +187,17 @@ int __class_register(struct class *cls, struct 
lock_class_key *key)
 
error = kset_register(&cp->subsys);
if (error) {
+   kfree_const(cp->subsys.kobj.name);
kfree(cp);
return error;
}
error = class_add_groups(class_get(cls), cls->class_groups);
class_put(cls);
+   if (error) {
+   kobject_del(&cp->subsys.kobj);
+   kfree_const(cp->subsys.kobj.name);
+   kfree(cp);
+   }
return error;
 }
 EXPORT_SYMBOL_GPL(__class_register);
-- 
2.25.1

[PATCH RESEND] drm/amd/display: move remaining FPU code to dml folder

2022-10-21 Thread Ao Zhong

Move remaining FPU code to dml folder
in preparation for enabling aarch64 support.

Signed-off-by: Ao Zhong 
---
 .../drm/amd/display/dc/dcn10/dcn10_resource.c | 44 +--
 .../drm/amd/display/dc/dcn32/dcn32_resource.c |  5 ++-
 .../drm/amd/display/dc/dml/dcn10/dcn10_fpu.c  | 40 +
 .../drm/amd/display/dc/dml/dcn10/dcn10_fpu.h  |  3 ++
 .../drm/amd/display/dc/dml/dcn32/dcn32_fpu.c  |  8 
 .../drm/amd/display/dc/dml/dcn32/dcn32_fpu.h  |  3 ++
 6 files changed, 59 insertions(+), 44 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
index 56d30baf12df..6bfac8088ab0 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
@@ -1295,47 +1295,6 @@ static uint32_t read_pipe_fuses(struct dc_context *ctx)
return value;
 }
 
-/*
- * Some architectures don't support soft-float (e.g. aarch64), on those
- * this function has to be called with hardfloat enabled, make sure not
- * to inline it so whatever fp stuff is done stays inside
- */
-static noinline void dcn10_resource_construct_fp(
-   struct dc *dc)
-{
-   if (dc->ctx->dce_version == DCN_VERSION_1_01) {
-   struct dcn_soc_bounding_box *dcn_soc = dc->dcn_soc;
-   struct dcn_ip_params *dcn_ip = dc->dcn_ip;
-   struct display_mode_lib *dml = &dc->dml;
-
-   dml->ip.max_num_dpp = 3;
-   /* TODO how to handle 23.84? */
-   dcn_soc->dram_clock_change_latency = 23;
-   dcn_ip->max_num_dpp = 3;
-   }
-   if (ASICREV_IS_RV1_F0(dc->ctx->asic_id.hw_internal_rev)) {
-   dc->dcn_soc->urgent_latency = 3;
-   dc->debug.disable_dmcu = true;
-   dc->dcn_soc->fabric_and_dram_bandwidth_vmax0p9 = 41.60f;
-   }
-
-
-   dc->dcn_soc->number_of_channels = dc->ctx->asic_id.vram_width / 
ddr4_dram_width;
-   ASSERT(dc->dcn_soc->number_of_channels < 3);
-   if (dc->dcn_soc->number_of_channels == 0)/*old sbios bug*/
-   dc->dcn_soc->number_of_channels = 2;
-
-   if (dc->dcn_soc->number_of_channels == 1) {
-   dc->dcn_soc->fabric_and_dram_bandwidth_vmax0p9 = 19.2f;
-   dc->dcn_soc->fabric_and_dram_bandwidth_vnom0p8 = 17.066f;
-   dc->dcn_soc->fabric_and_dram_bandwidth_vmid0p72 = 14.933f;
-   dc->dcn_soc->fabric_and_dram_bandwidth_vmin0p65 = 12.8f;
-   if (ASICREV_IS_RV1_F0(dc->ctx->asic_id.hw_internal_rev)) {
-   dc->dcn_soc->fabric_and_dram_bandwidth_vmax0p9 = 20.80f;
-   }
-   }
-}
-
 static bool verify_clock_values(struct dm_pp_clock_levels_with_voltage *clks)
 {
int i;
@@ -1510,8 +1469,9 @@ static bool dcn10_resource_construct(
memcpy(dc->dcn_ip, &dcn10_ip_defaults, sizeof(dcn10_ip_defaults));
memcpy(dc->dcn_soc, &dcn10_soc_defaults, sizeof(dcn10_soc_defaults));
 
-   /* Other architectures we build for build this with soft-float */
+   DC_FP_START();
dcn10_resource_construct_fp(dc);
+   DC_FP_END();
 
if (!dc->config.is_vmin_only_asic)
if (ASICREV_IS_RAVEN2(dc->ctx->asic_id.hw_internal_rev))
diff --git a/drivers/gpu/drm/amd/display/dc/dcn32/dcn32_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn32/dcn32_resource.c
index a88dd7b3d1c1..287b7fa9bf41 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn32/dcn32_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn32/dcn32_resource.c
@@ -1918,8 +1918,9 @@ int dcn32_populate_dml_pipes_from_context(
timing = &pipe->stream->timing;
 
pipes[pipe_cnt].pipe.src.gpuvm = true;
-   pipes[pipe_cnt].pipe.src.dcc_fraction_of_zs_req_luma = 0;
-   pipes[pipe_cnt].pipe.src.dcc_fraction_of_zs_req_chroma = 0;
+   DC_FP_START();
+   dcn32_zero_pipe_dcc_fraction(pipes, pipe_cnt);
+   DC_FP_END();
pipes[pipe_cnt].pipe.dest.vfront_porch = timing->v_front_porch;
pipes[pipe_cnt].pipe.src.gpuvm_min_page_size_kbytes = 256; // 
according to spreadsheet
pipes[pipe_cnt].pipe.src.unbounded_req_mode = false;
diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn10/dcn10_fpu.c 
b/drivers/gpu/drm/amd/display/dc/dml/dcn10/dcn10_fpu.c
index 99644d896222..0495cecaf1df 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn10/dcn10_fpu.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn10/dcn10_fpu.c
@@ -27,6 +27,8 @@
 #include "dcn10/dcn10_resource.h"
 
 #include "dcn10_fpu.h"
+#include "resource.h"
+#include "amdgpu_dm/dc_fpu.h"
 
 /**
  * DOC: DCN10 FPU manipulation Overview
@@ -121,3 +123,41 @@ struct _vcs_dpi_soc_bounding_box_st dcn1_0_soc = {
.writeback_dram_clock_change_latency_us = 23.0,
.return_bus_width_bytes = 64,
 };
+
+void dcn10_resource_construct_fp(
+   struct dc *dc)
+{
+   dc_assert_fp_enabled();
+

[PATCH 09/11] ocfs2: possible memory leak in mlog_sys_init()

2022-10-21 Thread Yang Yingliang

Inject fault while loading module, kset_register() may fail,
if it fails, but the refcount of kobject is not decreased to
0, the name allocated in kobject_set_name() is leaked. Fix
this by calling kset_put(), so that name can be freed in
callback function kobject_cleanup().

unreferenced object 0x888100da9348 (size 8):
  comm "modprobe", pid 257, jiffies 4294701096 (age 33.334s)
  hex dump (first 8 bytes):
6c 6f 67 6d 61 73 6b 00  logmask.
  backtrace:
[<306e441c>] __kmalloc_node_track_caller+0x44/0x1b0
[<7c491a9e>] kstrdup+0x3a/0x70
[<15719a3b>] kstrdup_const+0x63/0x80
[<84e458ea>] kvasprintf_const+0x149/0x180
[<91302b42>] kobject_set_name_vargs+0x56/0x150
[<5f48eeac>] kobject_set_name+0xab/0xe0

Fixes: 34980ca8faeb ("Drivers: clean up direct setting of the name of a kset")
Signed-off-by: Yang Yingliang 
---
 fs/ocfs2/cluster/masklog.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ocfs2/cluster/masklog.c b/fs/ocfs2/cluster/masklog.c
index 563881ddbf00..7f9ba816d955 100644
--- a/fs/ocfs2/cluster/masklog.c
+++ b/fs/ocfs2/cluster/masklog.c
@@ -156,6 +156,7 @@ static struct kset mlog_kset = {
 int mlog_sys_init(struct kset *o2cb_kset)
 {
int i = 0;
+   int ret;
 
while (mlog_attrs[i].attr.mode) {
mlog_default_attrs[i] = &mlog_attrs[i].attr;
@@ -165,7 +166,11 @@ int mlog_sys_init(struct kset *o2cb_kset)
 
kobject_set_name(&mlog_kset.kobj, "logmask");
mlog_kset.kobj.kset = o2cb_kset;
-   return kset_register(&mlog_kset);
+   ret = kset_register(&mlog_kset);
+   if (ret)
+   kset_put(&mlog_kset);
+
+   return ret;
 }
 
 void mlog_sys_shutdown(void)
-- 
2.25.1

[PATCH] drm/amd/display: move remaining FPU code to dml folder

2022-10-21 Thread Ao Zhong


Subject: [PATCH] drm/amd/display: move remaining FPU code to dml folder

Move remaining FPU code to dml folder
in preparation for enabling aarch64 support.

Signed-off-by: Ao Zhong 
---
 .../drm/amd/display/dc/dcn10/dcn10_resource.c | 44 +--
 .../drm/amd/display/dc/dcn32/dcn32_resource.c |  5 ++-
 .../drm/amd/display/dc/dml/dcn10/dcn10_fpu.c  | 40 +
 .../drm/amd/display/dc/dml/dcn10/dcn10_fpu.h  |  3 ++
 .../drm/amd/display/dc/dml/dcn32/dcn32_fpu.c  |  8 
 .../drm/amd/display/dc/dml/dcn32/dcn32_fpu.h  |  3 ++
 6 files changed, 59 insertions(+), 44 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c

index 56d30baf12df..6bfac8088ab0 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
@@ -1295,47 +1295,6 @@ static uint32_t read_pipe_fuses(struct dc_context 
*ctx)

 return value;
 }

-/*
- * Some architectures don't support soft-float (e.g. aarch64), on those
- * this function has to be called with hardfloat enabled, make sure not
- * to inline it so whatever fp stuff is done stays inside
- */
-static noinline void dcn10_resource_construct_fp(
-    struct dc *dc)
-{
-    if (dc->ctx->dce_version == DCN_VERSION_1_01) {
-        struct dcn_soc_bounding_box *dcn_soc = dc->dcn_soc;
-        struct dcn_ip_params *dcn_ip = dc->dcn_ip;
-        struct display_mode_lib *dml = &dc->dml;
-
-        dml->ip.max_num_dpp = 3;
-        /* TODO how to handle 23.84? */
-        dcn_soc->dram_clock_change_latency = 23;
-        dcn_ip->max_num_dpp = 3;
-    }
-    if (ASICREV_IS_RV1_F0(dc->ctx->asic_id.hw_internal_rev)) {
-        dc->dcn_soc->urgent_latency = 3;
-        dc->debug.disable_dmcu = true;
-        dc->dcn_soc->fabric_and_dram_bandwidth_vmax0p9 = 41.60f;
-    }
-
-
-    dc->dcn_soc->number_of_channels = dc->ctx->asic_id.vram_width / 
ddr4_dram_width;

-    ASSERT(dc->dcn_soc->number_of_channels < 3);
-    if (dc->dcn_soc->number_of_channels == 0)/*old sbios bug*/
-        dc->dcn_soc->number_of_channels = 2;
-
-    if (dc->dcn_soc->number_of_channels == 1) {
-        dc->dcn_soc->fabric_and_dram_bandwidth_vmax0p9 = 19.2f;
-        dc->dcn_soc->fabric_and_dram_bandwidth_vnom0p8 = 17.066f;
-        dc->dcn_soc->fabric_and_dram_bandwidth_vmid0p72 = 14.933f;
-        dc->dcn_soc->fabric_and_dram_bandwidth_vmin0p65 = 12.8f;
-        if (ASICREV_IS_RV1_F0(dc->ctx->asic_id.hw_internal_rev)) {
-            dc->dcn_soc->fabric_and_dram_bandwidth_vmax0p9 = 20.80f;
-        }
-    }
-}
-
 static bool verify_clock_values(struct dm_pp_clock_levels_with_voltage 
*clks)

 {
 int i;
@@ -1510,8 +1469,9 @@ static bool dcn10_resource_construct(
 memcpy(dc->dcn_ip, &dcn10_ip_defaults, sizeof(dcn10_ip_defaults));
 memcpy(dc->dcn_soc, &dcn10_soc_defaults, sizeof(dcn10_soc_defaults));

-    /* Other architectures we build for build this with soft-float */
+    DC_FP_START();
 dcn10_resource_construct_fp(dc);
+    DC_FP_END();

 if (!dc->config.is_vmin_only_asic)
     if (ASICREV_IS_RAVEN2(dc->ctx->asic_id.hw_internal_rev))
diff --git a/drivers/gpu/drm/amd/display/dc/dcn32/dcn32_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn32/dcn32_resource.c

index a88dd7b3d1c1..287b7fa9bf41 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn32/dcn32_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn32/dcn32_resource.c
@@ -1918,8 +1918,9 @@ int dcn32_populate_dml_pipes_from_context(
     timing = &pipe->stream->timing;

     pipes[pipe_cnt].pipe.src.gpuvm = true;
-        pipes[pipe_cnt].pipe.src.dcc_fraction_of_zs_req_luma = 0;
-        pipes[pipe_cnt].pipe.src.dcc_fraction_of_zs_req_chroma = 0;
+        DC_FP_START();
+        dcn32_zero_pipe_dcc_fraction(pipes, pipe_cnt);
+        DC_FP_END();
     pipes[pipe_cnt].pipe.dest.vfront_porch = timing->v_front_porch;
     pipes[pipe_cnt].pipe.src.gpuvm_min_page_size_kbytes = 256; // 
according to spreadsheet

     pipes[pipe_cnt].pipe.src.unbounded_req_mode = false;
diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn10/dcn10_fpu.c 
b/drivers/gpu/drm/amd/display/dc/dml/dcn10/dcn10_fpu.c

index 99644d896222..0495cecaf1df 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn10/dcn10_fpu.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn10/dcn10_fpu.c
@@ -27,6 +27,8 @@
 #include "dcn10/dcn10_resource.h"

 #include "dcn10_fpu.h"
+#include "resource.h"
+#include "amdgpu_dm/dc_fpu.h"

 /**
  * DOC: DCN10 FPU manipulation Overview
@@ -121,3 +123,41 @@ struct _vcs_dpi_soc_bounding_box_st dcn1_0_soc = {
 .writeback_dram_clock_change_latency_us = 23.0,
 .return_bus_width_bytes = 64,
 };
+
+void dcn10_resource_construct_fp(
+    struct dc *dc)
+{
+    dc_assert_fp_enabled();
+
+    if (dc->ctx->dce_version == DCN_VERSION_1_01) {
+        struct dcn_soc_bounding_box *dcn_soc = dc->dcn_soc;
+        struct dcn_ip_params *dcn_ip = dc->dcn_ip;
+        struct display_mode_lib *

[PATCH 11/11] ubifs: Fix memory leak in ubifs_sysfs_init()

2022-10-21 Thread Yang Yingliang

From: Liu Shixin 

When insmod ubifs.ko, a kmemleak reported as below:

 unreferenced object 0x88817fb1a780 (size 8):
   comm "insmod", pid 25265, jiffies 4295239702 (age 100.130s)
   hex dump (first 8 bytes):
 75 62 69 66 73 00 ff ff  ubifs...
   backtrace:
 [] slab_post_alloc_hook+0x9c/0x3c0
 [] __kmalloc_track_caller+0x183/0x410
 [] kstrdup+0x3a/0x80
 [] kstrdup_const+0x66/0x80
 [] kvasprintf_const+0x155/0x190
 [] kobject_set_name_vargs+0x5b/0x150
 [] kobject_set_name+0xbb/0xf0
 [] do_one_initcall+0x14c/0x5a0
 [] do_init_module+0x1f0/0x660
 [] load_module+0x6d7e/0x7590
 [] __do_sys_finit_module+0x19f/0x230
 [] __x64_sys_finit_module+0x73/0xb0
 [] do_syscall_64+0x35/0x80
 [] entry_SYSCALL_64_after_hwframe+0x63/0xcd

When kset_register() failed, we should call kset_put to cleanup it.

Fixes: 2e3cbf425804 ("ubifs: Export filesystem error counters")
Signed-off-by: Liu Shixin 
Signed-off-by: Yang Yingliang 
---
 fs/ubifs/sysfs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ubifs/sysfs.c b/fs/ubifs/sysfs.c
index 06ad8fa1fcfb..54270ad36321 100644
--- a/fs/ubifs/sysfs.c
+++ b/fs/ubifs/sysfs.c
@@ -144,6 +144,8 @@ int __init ubifs_sysfs_init(void)
kobject_set_name(&ubifs_kset.kobj, "ubifs");
ubifs_kset.kobj.parent = fs_kobj;
ret = kset_register(&ubifs_kset);
+   if (ret)
+   kset_put(&ubifs_kset);
 
return ret;
 }
-- 
2.25.1

[PATCH 01/11] kset: fix documentation for kset_register()

2022-10-21 Thread Yang Yingliang

kset_register() is currently used in some places without calling
kset_put() in error path, because the callers think it should be
kset internal thing to do, but the driver core can not know what
caller doing with that memory at times. The memory could be freed
both in kset_put() and error path of caller, if it is called in
kset_register().

So make the function documentation more explicit about calling
kset_put() in the error path of caller.

Signed-off-by: Yang Yingliang 
---
 lib/kobject.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/kobject.c b/lib/kobject.c
index a0b2dbfcfa23..6da04353d974 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -834,6 +834,9 @@ EXPORT_SYMBOL_GPL(kobj_sysfs_ops);
 /**
  * kset_register() - Initialize and add a kset.
  * @k: kset.
+ *
+ * If this function returns an error, kset_put() must be called to
+ * properly clean up the memory associated with the object.
  */
 int kset_register(struct kset *k)
 {
-- 
2.25.1

[PATCH 06/11] firmware: qemu_fw_cfg: fix possible memory leak in fw_cfg_build_symlink()

2022-10-21 Thread Yang Yingliang

Inject fault while loading module, kset_register() may fail, if
it fails, but the refcount of kobject is not decreased to 0, the
name allocated in kobject_set_name() is leaked. To fix this by
calling kset_put(), so that name can be freed in callback function
kobject_cleanup() and 'subdir' is freed in kset_release().

unreferenced object 0x88810ad69050 (size 8):
  comm "swapper/0", pid 1, jiffies 4294677178 (age 38.812s)
  hex dump (first 8 bytes):
65 74 63 00 81 88 ff ff  etc.
  backtrace:
[] __kmalloc_node_track_caller+0x44/0x1b0
[<3f0167c7>] kstrdup+0x3a/0x70
[<49336709>] kstrdup_const+0x41/0x60
[<175616e4>] kvasprintf_const+0xf5/0x180
[<4bcc30f7>] kobject_set_name_vargs+0x56/0x150
[<4b0251bd>] kobject_set_name+0xab/0xe0
[<700151fb>] fw_cfg_sysfs_probe+0xa5b/0x1320

Fixes: 246c46ebaeae ("firmware: create directory hierarchy for sysfs fw_cfg 
entries")
Signed-off-by: Yang Yingliang 
---
 drivers/firmware/qemu_fw_cfg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c
index a69399a6b7c0..d036e69cabbb 100644
--- a/drivers/firmware/qemu_fw_cfg.c
+++ b/drivers/firmware/qemu_fw_cfg.c
@@ -544,7 +544,7 @@ static int fw_cfg_build_symlink(struct kset *dir,
}
ret = kset_register(subdir);
if (ret) {
-   kfree(subdir);
+   kset_put(subdir);
break;
}
 
-- 
2.25.1

[PATCH] drm/amd/display: add DCN support for ARM64

2022-10-21 Thread Ao Zhong

After moving all FPU code to the DML folder, we can enable DCN support
for the ARM64 platform. Remove the -mgeneral-regs-only CFLAG form the
code in the DML folder that needs to use hardware FPU, and add a control
mechanism for ARM Neon.

Signed-off-by: Ao Zhong 
---
 drivers/gpu/drm/amd/display/Kconfig   |  2 +-
 .../gpu/drm/amd/display/amdgpu_dm/dc_fpu.c|  6 ++
 drivers/gpu/drm/amd/display/dc/dml/Makefile   | 64 ---
 3 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/Kconfig 
b/drivers/gpu/drm/amd/display/Kconfig
index 0142affcdaa3..a7f1c4e51719 100644
--- a/drivers/gpu/drm/amd/display/Kconfig
+++ b/drivers/gpu/drm/amd/display/Kconfig
@@ -6,7 +6,7 @@ config DRM_AMD_DC
bool "AMD DC - Enable new display engine"
default y
select SND_HDA_COMPONENT if SND_HDA_CORE
-   select DRM_AMD_DC_DCN if (X86 || PPC64)
+   select DRM_AMD_DC_DCN if (X86 || PPC64 || (ARM64 && KERNEL_MODE_NEON))
help
  Choose this option if you want to use the new display engine
  support for AMDGPU. This adds required support for Vega and
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
index ab0c6d191038..1743ca0a3641 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
@@ -31,6 +31,8 @@
 #elif defined(CONFIG_PPC64)
 #include 
 #include 
+#elif defined(CONFIG_ARM64)
+#include 
 #endif
 
 /**
@@ -99,6 +101,8 @@ void dc_fpu_begin(const char *function_name, const int line)
preempt_disable();
enable_kernel_fp();
}
+#elif defined(CONFIG_ARM64)
+   kernel_neon_begin();
 #endif
}
 
@@ -136,6 +140,8 @@ void dc_fpu_end(const char *function_name, const int line)
disable_kernel_fp();
preempt_enable();
}
+#elif defined(CONFIG_ARM64)
+   kernel_neon_end();
 #endif
}
 
diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile 
b/drivers/gpu/drm/amd/display/dc/dml/Makefile
index d0c6cf61c676..3cdd109189e0 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile
@@ -33,6 +33,12 @@ ifdef CONFIG_PPC64
 dml_ccflags := -mhard-float -maltivec
 endif
 
+ifdef CONFIG_ARM64
+ifdef CONFIG_DRM_AMD_DC_DCN
+dml_rcflags_arm64 := -mgeneral-regs-only
+endif
+endif
+
 ifdef CONFIG_CC_IS_GCC
 ifeq ($(call cc-ifversion, -lt, 0701, y), y)
 IS_OLD_GCC = 1
@@ -87,32 +93,46 @@ CFLAGS_$(AMDDALPATH)/dc/dml/dsc/rc_calc_fpu.o := 
$(dml_ccflags)
 CFLAGS_$(AMDDALPATH)/dc/dml/calcs/dcn_calcs.o := $(dml_ccflags)
 CFLAGS_$(AMDDALPATH)/dc/dml/calcs/dcn_calc_auto.o := $(dml_ccflags)
 CFLAGS_$(AMDDALPATH)/dc/dml/calcs/dcn_calc_math.o := $(dml_ccflags) 
-Wno-tautological-compare
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/display_mode_vba.o := $(dml_rcflags)
+CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/display_mode_vba.o := $(dml_rcflags) 
$(dml_rcflags_arm64)
 CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn2x/dcn2x.o := $(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn20/display_rq_dlg_calc_20.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20v2.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn20/display_rq_dlg_calc_20v2.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn21/display_mode_vba_21.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn21/display_rq_dlg_calc_21.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn30/display_mode_vba_30.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn30/display_rq_dlg_calc_30.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn31/display_mode_vba_31.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn31/display_rq_dlg_calc_31.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn32/display_mode_vba_32.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn32/display_rq_dlg_calc_32.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn32/display_mode_vba_util_32.o := 
$(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn301/dcn301_fpu.o := $(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/display_mode_lib.o := $(dml_rcflags)
-CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dsc/rc_calc_fpu.o  := $(dml_rcflags)
+CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20.o := 
$(dml_rcflags) $(dml_rcflags_arm64)
+CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn20/display_rq_dlg_calc_20.o := 
$(dml_rcflags) $(dml_rcflags_arm64)
+CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20v2.o := 
$(dml_rcflags) $(dml_rcflags_arm64)
+CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn20/display_rq_dlg_calc_20v2.o := 
$(dml_rcflags) $(dml_rcflags_arm64)
+CFLAGS_REMOVE_$(AMDDALPATH)/dc/dml/dcn21/display_mode_vba_21.o := 
$(dml_rcflags) $(dml_rcflags_arm64)
+CF

Re: [PATCH] drm/amd/display: Revert logic for plane modifiers

2022-10-21 Thread Joaquin Aramendia

Hello. Thanks for the reply, Rodrigo
Will redo the patch and resubmit.

Cheers

[PATCH 08/11] erofs: fix possible memory leak in erofs_init_sysfs()

2022-10-21 Thread Yang Yingliang

Inject fault while loading module, kset_register() may fail,
if it fails, but the refcount of kobject is not decreased to
0, the name allocated in kobject_set_name() is leaked. Fix
this by calling kset_put(), so that name can be freed in
callback function kobject_cleanup().

unreferenced object 0x888101d228c0 (size 8):
  comm "modprobe", pid 276, jiffies 4294722700 (age 13.151s)
  hex dump (first 8 bytes):
65 72 6f 66 73 00 ff ff  erofs...
  backtrace:
[] __kmalloc_node_track_caller+0x44/0x1b0
[] kstrdup+0x3a/0x70
[<4a0e01d2>] kstrdup_const+0x63/0x80
[<051b6cda>] kvasprintf_const+0x149/0x180
[<4dc51dad>] kobject_set_name_vargs+0x56/0x150
[] kobject_set_name+0xab/0xe0

Fixes: 168e9a76200c ("erofs: add sysfs interface")
Signed-off-by: Yang Yingliang 
---
 fs/erofs/sysfs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/erofs/sysfs.c b/fs/erofs/sysfs.c
index 783bb7b21b51..653b35001bc5 100644
--- a/fs/erofs/sysfs.c
+++ b/fs/erofs/sysfs.c
@@ -254,8 +254,10 @@ int __init erofs_init_sysfs(void)
kobject_set_name(&erofs_root.kobj, "erofs");
erofs_root.kobj.parent = fs_kobj;
ret = kset_register(&erofs_root);
-   if (ret)
+   if (ret) {
+   kset_put(&erofs_root);
goto root_err;
+   }
 
ret = kobject_init_and_add(&erofs_feat, &erofs_feat_ktype,
   NULL, "features");
-- 
2.25.1

[PATCH 10/11] drm/amdgpu/discovery: fix possible memory leak

2022-10-21 Thread Yang Yingliang

If kset_register() fails, the refcount of kobject is not 0,
the name allocated in kobject_set_name(&kset.kobj, ...) is
leaked. Fix this by calling kset_put(), so that it will be
freed in callback function kobject_cleanup().

Cc: sta...@vger.kernel.org
Fixes: a6c40b178092 ("drm/amdgpu: Show IP discovery in sysfs")
Signed-off-by: Yang Yingliang 
Reviewed-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 3993e6134914..638edcf70227 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -863,7 +863,7 @@ static int amdgpu_discovery_sysfs_ips(struct amdgpu_device 
*adev,
res = kset_register(&ip_hw_id->hw_id_kset);
if (res) {
DRM_ERROR("Couldn't register ip_hw_id 
kset");
-   kfree(ip_hw_id);
+   kset_put(&ip_hw_id->hw_id_kset);
return res;
}
if (hw_id_names[ii]) {
@@ -954,7 +954,7 @@ static int amdgpu_discovery_sysfs_recurse(struct 
amdgpu_device *adev)
res = kset_register(&ip_die_entry->ip_kset);
if (res) {
DRM_ERROR("Couldn't register ip_die_entry kset");
-   kfree(ip_die_entry);
+   kset_put(&ip_die_entry->ip_kset);
return res;
}
 
@@ -989,6 +989,7 @@ static int amdgpu_discovery_sysfs_init(struct amdgpu_device 
*adev)
res = kset_register(&adev->ip_top->die_kset);
if (res) {
DRM_ERROR("Couldn't register die_kset");
+   kset_put(&adev->ip_top->die_kset);
goto Err;
}
 
-- 
2.25.1

Re: [PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Greg KH

On Fri, Oct 21, 2022 at 01:29:31AM -0400, Luben Tuikov wrote:
> On 2022-10-20 22:20, Yang Yingliang wrote:
> > The previous discussion link:
> > https://lore.kernel.org/lkml/0db486eb-6927-927e-3629-958f8f211...@huawei.com/T/
> 
> The very first discussion on this was here:
> 
> https://www.spinics.net/lists/dri-devel/msg368077.html
> 
> Please use this link, and not the that one up there you which quoted above,
> and whose commit description is taken verbatim from the this link.
> 
> > 
> > kset_register() is currently used in some places without calling
> > kset_put() in error path, because the callers think it should be
> > kset internal thing to do, but the driver core can not know what
> > caller doing with that memory at times. The memory could be freed
> > both in kset_put() and error path of caller, if it is called in
> > kset_register().
> 
> As I explained in the link above, the reason there's
> a memory leak is that one cannot call kset_register() without
> the kset->kobj.name being set--kobj_add_internal() returns -EINVAL,
> in this case, i.e. kset_register() fails with -EINVAL.
> 
> Thus, the most common usage is something like this:
> 
>   kobj_set_name(&kset->kobj, format, ...);
>   kset->kobj.kset = parent_kset;
>   kset->kobj.ktype = ktype;
>   res = kset_register(kset);
> 
> So, what is being leaked, is the memory allocated in kobj_set_name(),
> by the common idiom shown above. This needs to be mentioned in
> the documentation, at least, in case, in the future this is absolved
> in kset_register() redesign, etc.

Based on this, can kset_register() just clean up from itself when an
error happens?  Ideally that would be the case, as the odds of a kset
being embedded in a larger structure is probably slim, but we would have
to search the tree to make sure.

thanks,

greg k-h

[PATCH 07/11] f2fs: fix possible memory leak in f2fs_init_sysfs()

2022-10-21 Thread Yang Yingliang

Inject fault while loading module, kset_register() may fail,
if it fails, but the refcount of kobject is not decreased to
0, the name allocated in kobject_set_name() is leaked. Fix
this by calling kset_put(), so that name can be freed in
callback function kobject_cleanup().

unreferenced object 0x888101b7cc80 (size 8):
  comm "modprobe", pid 252, jiffies 4294691378 (age 31.760s)
  hex dump (first 8 bytes):
66 32 66 73 00 88 ff ff  f2fs
  backtrace:
[<1db5b408>] __kmalloc_node_track_caller+0x44/0x1b0
[<2783a073>] kstrdup+0x3a/0x70
[] kstrdup_const+0x63/0x80
[<3e5cf8f7>] kvasprintf_const+0x149/0x180
[] kobject_set_name_vargs+0x56/0x150
[<44611660>] kobject_set_name+0xab/0xe0

Fixes: bf9e697ecd42 ("f2fs: expose features to sysfs entry")
Signed-off-by: Yang Yingliang 
Reviewed-by: Chao Yu 
---
 fs/f2fs/sysfs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index df27afd71ef4..2ef7a48967be 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -1250,8 +1250,10 @@ int __init f2fs_init_sysfs(void)
kobject_set_name(&f2fs_kset.kobj, "f2fs");
f2fs_kset.kobj.parent = fs_kobj;
ret = kset_register(&f2fs_kset);
-   if (ret)
+   if (ret) {
+   kset_put(&f2fs_kset);
return ret;
+   }
 
ret = kobject_init_and_add(&f2fs_feat, &f2fs_feat_ktype,
   NULL, "features");
-- 
2.25.1

[PATCH 02/11] kset: add null pointer check in kset_put()

2022-10-21 Thread Yang Yingliang

kset_put() can be called from drivers, add null pointer
check to make it more robust.

Signed-off-by: Yang Yingliang 
---
 include/linux/kobject.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 57fb972fea05..e81de8ba41a2 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -195,7 +195,8 @@ static inline struct kset *kset_get(struct kset *k)
 
 static inline void kset_put(struct kset *k)
 {
-   kobject_put(&k->kobj);
+   if (k)
+   kobject_put(&k->kobj);
 }
 
 static inline const struct kobj_type *get_ktype(struct kobject *kobj)
-- 
2.25.1

[PATCH 00/11] fix memory leak while kset_register() fails

2022-10-21 Thread Yang Yingliang

The previous discussion link:
https://lore.kernel.org/lkml/0db486eb-6927-927e-3629-958f8f211...@huawei.com/T/

kset_register() is currently used in some places without calling
kset_put() in error path, because the callers think it should be
kset internal thing to do, but the driver core can not know what
caller doing with that memory at times. The memory could be freed
both in kset_put() and error path of caller, if it is called in
kset_register().

So make the function documentation more explicit about calling
kset_put() in the error path of caller first, so that people
have a chance to know what to do here, then fixes this leaks
by calling kset_put() from callers.

Liu Shixin (1):
  ubifs: Fix memory leak in ubifs_sysfs_init()

Yang Yingliang (10):
  kset: fix documentation for kset_register()
  kset: add null pointer check in kset_put()
  bus: fix possible memory leak in bus_register()
  kobject: fix possible memory leak in kset_create_and_add()
  class: fix possible memory leak in __class_register()
  firmware: qemu_fw_cfg: fix possible memory leak in
fw_cfg_build_symlink()
  f2fs: fix possible memory leak in f2fs_init_sysfs()
  erofs: fix possible memory leak in erofs_init_sysfs()
  ocfs2: possible memory leak in mlog_sys_init()
  drm/amdgpu/discovery: fix possible memory leak

 drivers/base/bus.c| 4 +++-
 drivers/base/class.c  | 6 ++
 drivers/firmware/qemu_fw_cfg.c| 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 5 +++--
 fs/erofs/sysfs.c  | 4 +++-
 fs/f2fs/sysfs.c   | 4 +++-
 fs/ocfs2/cluster/masklog.c| 7 ++-
 fs/ubifs/sysfs.c  | 2 ++
 include/linux/kobject.h   | 3 ++-
 lib/kobject.c | 5 -
 10 files changed, 33 insertions(+), 9 deletions(-)

-- 
2.25.1

[PATCH] drm/amd/display: Remove duplicate code for DCN314 DML calculation

2022-10-21 Thread Rafael Mendonca

This is an extension of commit fd3bc691fc7b ("drm/amd/display: Remove
duplicate code across dcn30 and dcn31"), which removed duplicate code for
the function CalculateBytePerPixelAnd256BBlockSizes() across dcn30 and
dcn31. At the time the aforementioned commit was introduced, support for
DCN 3.1.4 was still not merged. Thus, this deletes duplicate code for
CalculateBytePerPixelAnd256BBlockSizes(), that was introduced later in
DCN314, in favor of using the respective functionality from
'display_mode_vba_30.h'.

Additionally, by doing that, we also fix a duplicate argument issue
reported by coccinelle in 'display_rq_dlg_calc_314.c':

  static bool CalculateBytePerPixelAnd256BBlockSizes(...) {
...
} else if (SourcePixelFormat == dm_444_16 || SourcePixelFormat == 
dm_444_16) {
...
  }

Signed-off-by: Rafael Mendonca 
---
 .../dc/dml/dcn314/display_mode_vba_314.c  | 106 +-
 .../dc/dml/dcn314/display_rq_dlg_calc_314.c   |  91 +--
 2 files changed, 6 insertions(+), 191 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c 
b/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c
index 0d12fd079cd6..6e43cd21a7d3 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c
@@ -32,6 +32,7 @@
 #include "../display_mode_lib.h"
 #include "display_mode_vba_314.h"
 #include "../dml_inline_defs.h"
+#include "dml/dcn30/display_mode_vba_30.h"
 
 /*
  * NOTE:
@@ -90,17 +91,6 @@ typedef struct {
 #define BPP_INVALID 0
 #define BPP_BLENDED_PIPE 0x
 
-static bool CalculateBytePerPixelAnd256BBlockSizes(
-   enum source_format_class SourcePixelFormat,
-   enum dm_swizzle_mode SurfaceTiling,
-   unsigned int *BytePerPixelY,
-   unsigned int *BytePerPixelC,
-   double *BytePerPixelDETY,
-   double *BytePerPixelDETC,
-   unsigned int *BlockHeight256BytesY,
-   unsigned int *BlockHeight256BytesC,
-   unsigned int *BlockWidth256BytesY,
-   unsigned int *BlockWidth256BytesC);
 static void DisplayPipeConfiguration(struct display_mode_lib *mode_lib);
 static void 
DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerformanceCalculation(struct
 display_mode_lib *mode_lib);
 static unsigned int dscceComputeDelay(
@@ -2178,7 +2168,7 @@ static void 
DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerforman
DTRACE("   return_bus_bw  = %f", v->ReturnBW);
 
for (k = 0; k < v->NumberOfActivePlanes; ++k) {
-   CalculateBytePerPixelAnd256BBlockSizes(
+   dml30_CalculateBytePerPixelAnd256BBlockSizes(
v->SourcePixelFormat[k],
v->SurfaceTiling[k],
&v->BytePerPixelY[k],
@@ -3317,7 +3307,7 @@ static void DisplayPipeConfiguration(struct 
display_mode_lib *mode_lib)
 
for (k = 0; k < v->NumberOfActivePlanes; ++k) {
 
-   CalculateBytePerPixelAnd256BBlockSizes(
+   dml30_CalculateBytePerPixelAnd256BBlockSizes(
v->SourcePixelFormat[k],
v->SurfaceTiling[k],
&BytePerPixY[k],
@@ -3371,94 +3361,6 @@ static void DisplayPipeConfiguration(struct 
display_mode_lib *mode_lib)
&dummysinglestring);
 }
 
-static bool CalculateBytePerPixelAnd256BBlockSizes(
-   enum source_format_class SourcePixelFormat,
-   enum dm_swizzle_mode SurfaceTiling,
-   unsigned int *BytePerPixelY,
-   unsigned int *BytePerPixelC,
-   double *BytePerPixelDETY,
-   double *BytePerPixelDETC,
-   unsigned int *BlockHeight256BytesY,
-   unsigned int *BlockHeight256BytesC,
-   unsigned int *BlockWidth256BytesY,
-   unsigned int *BlockWidth256BytesC)
-{
-   if (SourcePixelFormat == dm_444_64) {
-   *BytePerPixelDETY = 8;
-   *BytePerPixelDETC = 0;
-   *BytePerPixelY = 8;
-   *BytePerPixelC = 0;
-   } else if (SourcePixelFormat == dm_444_32 || SourcePixelFormat == 
dm_rgbe) {
-   *BytePerPixelDETY = 4;
-   *BytePerPixelDETC = 0;
-   *BytePerPixelY = 4;
-   *BytePerPixelC = 0;
-   } else if (SourcePixelFormat == dm_444_16) {
-   *BytePerPixelDETY = 2;
-   *BytePerPixelDETC = 0;
-   *BytePerPixelY = 2;
-   *BytePerPixelC = 0;
-   } else if (SourcePixelFormat == dm_444_8) {
-   *BytePerPixelDETY = 1;
-   *BytePerPixelDETC = 0;
-   *BytePerPixelY = 1;
-   *BytePerPixelC = 0;
-   } else if (SourcePixelFormat == dm_rgbe_alpha) {
-   *BytePerPixelDETY = 4;
-

[PATCH 04/11] kobject: fix possible memory leak in kset_create_and_add()

2022-10-21 Thread Yang Yingliang

Inject fault while loading module (e.g. qemu_fw_cfg.ko), kset_register()
may fail in kset_create_and_add(), if it fails, but the refcount of kobject
is not decreased to 0, the name allocated in kset_create() is leaked. To fix
this by calling kset_put(), so that name can be freed in callback function
kobject_cleanup() and kset can be freed in kset_release().

unreferenced object 0x888103cc8c08 (size 8):
  comm "modprobe", pid 508, jiffies 4294915182 (age 120.020s)
  hex dump (first 8 bytes):
62 79 5f 6e 61 6d 65 00  by_name.
  backtrace:
[<572f97f9>] __kmalloc_track_caller+0x1ae/0x320
[] kstrdup+0x3a/0x70
[<1cd0d05e>] kstrdup_const+0x68/0x80
[] kvasprintf_const+0x10b/0x190
[<88f2b8df>] kobject_set_name_vargs+0x56/0x150
[<3f8aca68>] kobject_set_name+0xab/0xe0
[<249f7816>] kset_create_and_add+0x72/0x200

Fixes: b727c702896f ("kset: add kset_create_and_add function")
Signed-off-by: Yang Yingliang 
---
 lib/kobject.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/kobject.c b/lib/kobject.c
index 6da04353d974..e77f37200876 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -985,7 +985,7 @@ struct kset *kset_create_and_add(const char *name,
return NULL;
error = kset_register(kset);
if (error) {
-   kfree(kset);
+   kset_put(kset);
return NULL;
}
return kset;
-- 
2.25.1

[PATCH] drm/amd/display: Revert logic for plane modifiers

2022-10-21 Thread Joaquín Ignacio Aramendía

This file was split in commit 5d945cbcd4b16a29d6470a80dfb19738f9a4319f
("drm/amd/display: Create a file dedicated to planes") and the logic in
dm_plane_format_mod_supported() function got changed by a switch logic.
That change broke drm_plane modifiers setting on series 5000 APUs
(tested on OXP mini AMD 5800U and HP Dev One 5850U PRO)
leading to Gamescope not working as reported on GitHub[1]

To reproduce the issue, enter a TTY and run:

$ gamescope -- vkcube

With said commit applied it will abort. This one restores the old logic,
fixing the issue that affects Gamescope.

[1](https://github.com/Plagman/gamescope/issues/624)

Signed-off-by: Joaquín Ignacio Aramendía 
---
 .../amd/display/amdgpu_dm/amdgpu_dm_plane.c   | 51 +++
 1 file changed, 8 insertions(+), 43 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
index dfd3be49eac8..1d47d2d69781 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
@@ -1369,7 +1369,8 @@ static bool dm_plane_format_mod_supported(struct 
drm_plane *plane,
 {
struct amdgpu_device *adev = drm_to_adev(plane->dev);
const struct drm_format_info *info = drm_format_info(format);
-   struct hw_asic_id asic_id = adev->dm.dc->ctx->asic_id;
+
+   int i;

enum dm_micro_swizzle microtile = modifier_gfx9_swizzle_mode(modifier) 
& 3;

@@ -1386,49 +1387,13 @@ static bool dm_plane_format_mod_supported(struct 
drm_plane *plane,
return true;
}

-   /* check if swizzle mode is supported by this version of DCN */
-   switch (asic_id.chip_family) {
-   case FAMILY_SI:
-   case FAMILY_CI:
-   case FAMILY_KV:
-   case FAMILY_CZ:
-   case FAMILY_VI:
-   /* asics before AI does not have modifier support */
-   return false;
-   case FAMILY_AI:
-   case FAMILY_RV:
-   case FAMILY_NV:
-   case FAMILY_VGH:
-   case FAMILY_YELLOW_CARP:
-   case AMDGPU_FAMILY_GC_10_3_6:
-   case AMDGPU_FAMILY_GC_10_3_7:
-   switch (AMD_FMT_MOD_GET(TILE, modifier)) {
-   case AMD_FMT_MOD_TILE_GFX9_64K_R_X:
-   case AMD_FMT_MOD_TILE_GFX9_64K_D_X:
-   case AMD_FMT_MOD_TILE_GFX9_64K_S_X:
-   case AMD_FMT_MOD_TILE_GFX9_64K_D:
-   return true;
-   default:
-   return false;
-   }
-   break;
-   case AMDGPU_FAMILY_GC_11_0_0:
-   case AMDGPU_FAMILY_GC_11_0_1:
-   switch (AMD_FMT_MOD_GET(TILE, modifier)) {
-   case AMD_FMT_MOD_TILE_GFX11_256K_R_X:
-   case AMD_FMT_MOD_TILE_GFX9_64K_R_X:
-   case AMD_FMT_MOD_TILE_GFX9_64K_D_X:
-   case AMD_FMT_MOD_TILE_GFX9_64K_S_X:
-   case AMD_FMT_MOD_TILE_GFX9_64K_D:
-   return true;
-   default:
-   return false;
-   }
-   break;
-   default:
-   ASSERT(0); /* Unknown asic */
-   break;
+   /* Check that the modifier is on the list of the plane's supported 
modifiers. */
+   for (i = 0; i < plane->modifier_count; i++) {
+   if (modifier == plane->modifiers[i])
+   break;
}
+   if (i == plane->modifier_count)
+   return false;

/*
 * For D swizzle the canonical modifier depends on the bpp, so check
--
2.38.1

[PATCH 03/11] bus: fix possible memory leak in bus_register()

2022-10-21 Thread Yang Yingliang

Inject fault while loading module (e.g. edac_core.ko), kset_register()
may fail in bus_register(), if it fails, but the refcount of kobject is
not decreased to 0, the name allocated in kobject_set_name() is leaked.
To fix this by calling kset_put(), so that name can be freed in callback
function kobject_cleanup().

unreferenced object 0x888103bddb68 (size 8):
  comm "systemd-udevd", pid 341, jiffies 4294903262 (age 42.212s)
  hex dump (first 8 bytes):
65 64 61 63 00 00 00 00  edac
  backtrace:
[<9e31d566>] __kmalloc_track_caller+0x1ae/0x320
[] kstrdup+0x3a/0x70
[<3d0ec369>] kstrdup_const+0x68/0x80
[<8e5c3b20>] kvasprintf_const+0x10b/0x190
[] kobject_set_name_vargs+0x56/0x150
[<0df9278c>] kobject_set_name+0xab/0xe0
[] bus_register+0x132/0x350
[<7d91c2e5>] subsys_register+0x23/0x220

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Yang Yingliang 
---
 drivers/base/bus.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 7ca47e5b3c1f..301b5330f9d8 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -804,8 +804,10 @@ int bus_register(struct bus_type *bus)
priv->drivers_autoprobe = 1;
 
retval = kset_register(&priv->subsys);
-   if (retval)
+   if (retval) {
+   kset_put(&priv->subsys);
goto out;
+   }
 
retval = bus_create_file(bus, &bus_attr_uevent);
if (retval)
-- 
2.25.1

58 matches

Mail list logo