RE: [PATCH 06/18] drm/amdgpu/sriov:fix memory leak after gpu reset

2017-09-19 Thread Liu, Monk
Oh, I see your point, but that actually presents for a cleanup patch, and mine 
is to add a condition to fix memory leak, I think they different purpose and 
should be separated,

I can add one more patch to cleanup it with that "create_bo_kenel" to make code 
more tight and clean

BR Monk

-Original Message-
From: Koenig, Christian 
Sent: 2017年9月18日 19:35
To: Liu, Monk <monk@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 06/18] drm/amdgpu/sriov:fix memory leak after gpu reset

Am 18.09.2017 um 12:47 schrieb Liu, Monk:
> I didn't get your point... how could bo_create_kernel solve my issue ?

It doesn't solve the underlying issue, you just need less code for your 
workaround.

With bo_create_kernel you can do create/pin/kmap in just one function call.

>
> The thing here is during gpu reset we invoke hw_init for every hw 
> component, and by design hw_init shouldn't doing anything software 
> related, thus the BO allocating in hw_init is wrong,

Yeah, but your patch doesn't fix that either as far as I can see.

> Even switch to bo_create_kernel won't address the issue ...

See the implementation of bo_create_kernel():
> if (!*bo_ptr) {
> r = amdgpu_bo_create(adev, size, align, true, domain,

> }

> r = amdgpu_bo_pin(*bo_ptr, domain, gpu_addr);
...
> if (cpu_addr) {
> r = amdgpu_bo_kmap(*bo_ptr, cpu_addr);
...
> }

Creating is actually optional, but the function always pins the BO once more 
and figures out it's CPU address.

As far as I can see that should solve your problem for now.

Christian.


>
>
> BR Monk
>
> -Original Message-
> From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
> Sent: 2017年9月18日 17:13
> To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 06/18] drm/amdgpu/sriov:fix memory leak after gpu 
> reset
>
> Am 18.09.2017 um 08:11 schrieb Monk Liu:
>> doing gpu reset will rerun all hw_init and thus ucode_init_bo is 
>> invoked again, so we need to skip the fw_buf allocation during sriov 
>> gpu reset to avoid memory leak.
>>
>> Change-Id: I31131eda1bd45ea2f5bdc50c5da5fc5a9fe9027d
>> Signed-off-by: Monk Liu <monk@amd.com>
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  3 ++
>>drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 64 
>> +++
>>2 files changed, 35 insertions(+), 32 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 6ff2959..3d0c633 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1185,6 +1185,9 @@ struct amdgpu_firmware {
>>
>>  /* gpu info firmware data pointer */
>>  const struct firmware *gpu_info_fw;
>> +
>> +void *fw_buf_ptr;
>> +uint64_t fw_buf_mc;
>>};
>>
>>/*
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
>> index f306374..6564902 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
>> @@ -360,8 +360,6 @@ static int amdgpu_ucode_patch_jt(struct 
>> amdgpu_firmware_info *ucode,
>>int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
>>{
>>  struct amdgpu_bo **bo = >firmware.fw_buf;
>> -uint64_t fw_mc_addr;
>> -void *fw_buf_ptr = NULL;
>>  uint64_t fw_offset = 0;
>>  int i, err;
>>  struct amdgpu_firmware_info *ucode = NULL; @@ -372,37 +370,39 @@ 
>> int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
>>  return 0;
>>  }
>>
>> -err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, true,
>> -amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM 
>> : AMDGPU_GEM_DOMAIN_GTT,
>> -AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
>> -NULL, NULL, 0, bo);
>> -if (err) {
>> -dev_err(adev->dev, "(%d) Firmware buffer allocate failed\n", 
>> err);
>> -goto failed;
>> -}
>> +if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {
> Instead of all this better use amdgpu_bo_create_kernel(), this should already 
> include most of the handling necessary here.
>
> Christian.
>
>> +err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, 
>> true,
>> +amdgpu_sriov_vf(adev) ? 
>> AMDGPU_GEM_DOMAIN_VRAM : AMD

Re: [PATCH 06/18] drm/amdgpu/sriov:fix memory leak after gpu reset

2017-09-18 Thread Christian König

Am 18.09.2017 um 12:47 schrieb Liu, Monk:

I didn't get your point... how could bo_create_kernel solve my issue ?


It doesn't solve the underlying issue, you just need less code for your 
workaround.


With bo_create_kernel you can do create/pin/kmap in just one function call.



The thing here is during gpu reset we invoke hw_init for every hw component, 
and by design hw_init shouldn't doing anything software related, thus the BO 
allocating in hw_init is wrong,


Yeah, but your patch doesn't fix that either as far as I can see.


Even switch to bo_create_kernel won't address the issue ...


See the implementation of bo_create_kernel():

if (!*bo_ptr) {
r = amdgpu_bo_create(adev, size, align, true, domain,



}



r = amdgpu_bo_pin(*bo_ptr, domain, gpu_addr);

...

if (cpu_addr) {
r = amdgpu_bo_kmap(*bo_ptr, cpu_addr);

...

}


Creating is actually optional, but the function always pins the BO once 
more and figures out it's CPU address.


As far as I can see that should solve your problem for now.

Christian.





BR Monk

-Original Message-
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2017年9月18日 17:13
To: Liu, Monk <monk@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 06/18] drm/amdgpu/sriov:fix memory leak after gpu reset

Am 18.09.2017 um 08:11 schrieb Monk Liu:

doing gpu reset will rerun all hw_init and thus ucode_init_bo is
invoked again, so we need to skip the fw_buf allocation during sriov
gpu reset to avoid memory leak.

Change-Id: I31131eda1bd45ea2f5bdc50c5da5fc5a9fe9027d
Signed-off-by: Monk Liu <monk@amd.com>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  3 ++
   drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 64 
+++
   2 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 6ff2959..3d0c633 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1185,6 +1185,9 @@ struct amdgpu_firmware {
   
   	/* gpu info firmware data pointer */

const struct firmware *gpu_info_fw;
+
+   void *fw_buf_ptr;
+   uint64_t fw_buf_mc;
   };
   
   /*

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index f306374..6564902 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
@@ -360,8 +360,6 @@ static int amdgpu_ucode_patch_jt(struct 
amdgpu_firmware_info *ucode,
   int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
   {
struct amdgpu_bo **bo = >firmware.fw_buf;
-   uint64_t fw_mc_addr;
-   void *fw_buf_ptr = NULL;
uint64_t fw_offset = 0;
int i, err;
struct amdgpu_firmware_info *ucode = NULL; @@ -372,37 +370,39 @@
int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
return 0;
}
   
-	err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, true,

-   amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM 
: AMDGPU_GEM_DOMAIN_GTT,
-   AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
-   NULL, NULL, 0, bo);
-   if (err) {
-   dev_err(adev->dev, "(%d) Firmware buffer allocate failed\n", 
err);
-   goto failed;
-   }
+   if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {

Instead of all this better use amdgpu_bo_create_kernel(), this should already 
include most of the handling necessary here.

Christian.


+   err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, 
true,
+   amdgpu_sriov_vf(adev) ? 
AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
+   AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
+   NULL, NULL, 0, bo);
+   if (err) {
+   dev_err(adev->dev, "(%d) Firmware buffer allocate 
failed\n", err);
+   goto failed;
+   }
   
-	err = amdgpu_bo_reserve(*bo, false);

-   if (err) {
-   dev_err(adev->dev, "(%d) Firmware buffer reserve failed\n", 
err);
-   goto failed_reserve;
-   }
+   err = amdgpu_bo_reserve(*bo, false);
+   if (err) {
+   dev_err(adev->dev, "(%d) Firmware buffer reserve 
failed\n", err);
+   goto failed_reserve;
+   }
   
-	err = amdgpu_bo_pin(*bo, amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,

-   _mc_addr);
-   if (err) {
-   dev_err(adev->dev, "(%d) Firmware buffer pin failed\n", err);
-   goto failed_pin;
-   }
+   err = a

RE: [PATCH 06/18] drm/amdgpu/sriov:fix memory leak after gpu reset

2017-09-18 Thread Liu, Monk
I didn't get your point... how could bo_create_kernel solve my issue ?

The thing here is during gpu reset we invoke hw_init for every hw component, 
and by design hw_init shouldn't doing anything software related, thus the BO 
allocating in hw_init is wrong,

Even switch to bo_create_kernel won't address the issue ...


BR Monk

-Original Message-
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com] 
Sent: 2017年9月18日 17:13
To: Liu, Monk <monk@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 06/18] drm/amdgpu/sriov:fix memory leak after gpu reset

Am 18.09.2017 um 08:11 schrieb Monk Liu:
> doing gpu reset will rerun all hw_init and thus ucode_init_bo is 
> invoked again, so we need to skip the fw_buf allocation during sriov 
> gpu reset to avoid memory leak.
>
> Change-Id: I31131eda1bd45ea2f5bdc50c5da5fc5a9fe9027d
> Signed-off-by: Monk Liu <monk@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  3 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 64 
> +++
>   2 files changed, 35 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 6ff2959..3d0c633 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1185,6 +1185,9 @@ struct amdgpu_firmware {
>   
>   /* gpu info firmware data pointer */
>   const struct firmware *gpu_info_fw;
> +
> + void *fw_buf_ptr;
> + uint64_t fw_buf_mc;
>   };
>   
>   /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> index f306374..6564902 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> @@ -360,8 +360,6 @@ static int amdgpu_ucode_patch_jt(struct 
> amdgpu_firmware_info *ucode,
>   int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
>   {
>   struct amdgpu_bo **bo = >firmware.fw_buf;
> - uint64_t fw_mc_addr;
> - void *fw_buf_ptr = NULL;
>   uint64_t fw_offset = 0;
>   int i, err;
>   struct amdgpu_firmware_info *ucode = NULL; @@ -372,37 +370,39 @@ 
> int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
>   return 0;
>   }
>   
> - err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, true,
> - amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM 
> : AMDGPU_GEM_DOMAIN_GTT,
> - AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
> - NULL, NULL, 0, bo);
> - if (err) {
> - dev_err(adev->dev, "(%d) Firmware buffer allocate failed\n", 
> err);
> - goto failed;
> - }
> + if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {

Instead of all this better use amdgpu_bo_create_kernel(), this should already 
include most of the handling necessary here.

Christian.

> + err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, 
> true,
> + amdgpu_sriov_vf(adev) ? 
> AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
> + AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
> + NULL, NULL, 0, bo);
> + if (err) {
> + dev_err(adev->dev, "(%d) Firmware buffer allocate 
> failed\n", err);
> + goto failed;
> + }
>   
> - err = amdgpu_bo_reserve(*bo, false);
> - if (err) {
> - dev_err(adev->dev, "(%d) Firmware buffer reserve failed\n", 
> err);
> - goto failed_reserve;
> - }
> + err = amdgpu_bo_reserve(*bo, false);
> + if (err) {
> + dev_err(adev->dev, "(%d) Firmware buffer reserve 
> failed\n", err);
> + goto failed_reserve;
> + }
>   
> - err = amdgpu_bo_pin(*bo, amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM 
> : AMDGPU_GEM_DOMAIN_GTT,
> - _mc_addr);
> - if (err) {
> - dev_err(adev->dev, "(%d) Firmware buffer pin failed\n", err);
> - goto failed_pin;
> - }
> + err = amdgpu_bo_pin(*bo, amdgpu_sriov_vf(adev) ? 
> AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
> + >firmware.fw_buf_mc);
> + if (err) {
> + dev_err(adev->dev, "(%d) Firmware buffer pin failed\n", 
> err);
> + goto failed_pin;
> + }
>   
> - err = amdgpu_bo_kmap(*bo, _buf_ptr);
> - if (err

Re: [PATCH 06/18] drm/amdgpu/sriov:fix memory leak after gpu reset

2017-09-18 Thread Christian König

Am 18.09.2017 um 08:11 schrieb Monk Liu:

doing gpu reset will rerun all hw_init and thus
ucode_init_bo is invoked again, so we need to skip
the fw_buf allocation during sriov gpu reset to avoid
memory leak.

Change-Id: I31131eda1bd45ea2f5bdc50c5da5fc5a9fe9027d
Signed-off-by: Monk Liu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  3 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 64 +++
  2 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 6ff2959..3d0c633 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1185,6 +1185,9 @@ struct amdgpu_firmware {
  
  	/* gpu info firmware data pointer */

const struct firmware *gpu_info_fw;
+
+   void *fw_buf_ptr;
+   uint64_t fw_buf_mc;
  };
  
  /*

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index f306374..6564902 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
@@ -360,8 +360,6 @@ static int amdgpu_ucode_patch_jt(struct 
amdgpu_firmware_info *ucode,
  int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
  {
struct amdgpu_bo **bo = >firmware.fw_buf;
-   uint64_t fw_mc_addr;
-   void *fw_buf_ptr = NULL;
uint64_t fw_offset = 0;
int i, err;
struct amdgpu_firmware_info *ucode = NULL;
@@ -372,37 +370,39 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
return 0;
}
  
-	err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, true,

-   amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM 
: AMDGPU_GEM_DOMAIN_GTT,
-   AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
-   NULL, NULL, 0, bo);
-   if (err) {
-   dev_err(adev->dev, "(%d) Firmware buffer allocate failed\n", 
err);
-   goto failed;
-   }
+   if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {


Instead of all this better use amdgpu_bo_create_kernel(), this should 
already include most of the handling necessary here.


Christian.


+   err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, 
true,
+   amdgpu_sriov_vf(adev) ? 
AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
+   AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
+   NULL, NULL, 0, bo);
+   if (err) {
+   dev_err(adev->dev, "(%d) Firmware buffer allocate 
failed\n", err);
+   goto failed;
+   }
  
-	err = amdgpu_bo_reserve(*bo, false);

-   if (err) {
-   dev_err(adev->dev, "(%d) Firmware buffer reserve failed\n", 
err);
-   goto failed_reserve;
-   }
+   err = amdgpu_bo_reserve(*bo, false);
+   if (err) {
+   dev_err(adev->dev, "(%d) Firmware buffer reserve 
failed\n", err);
+   goto failed_reserve;
+   }
  
-	err = amdgpu_bo_pin(*bo, amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,

-   _mc_addr);
-   if (err) {
-   dev_err(adev->dev, "(%d) Firmware buffer pin failed\n", err);
-   goto failed_pin;
-   }
+   err = amdgpu_bo_pin(*bo, amdgpu_sriov_vf(adev) ? 
AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
+   >firmware.fw_buf_mc);
+   if (err) {
+   dev_err(adev->dev, "(%d) Firmware buffer pin failed\n", 
err);
+   goto failed_pin;
+   }
  
-	err = amdgpu_bo_kmap(*bo, _buf_ptr);

-   if (err) {
-   dev_err(adev->dev, "(%d) Firmware buffer kmap failed\n", err);
-   goto failed_kmap;
-   }
+   err = amdgpu_bo_kmap(*bo, >firmware.fw_buf_ptr);
+   if (err) {
+   dev_err(adev->dev, "(%d) Firmware buffer kmap 
failed\n", err);
+   goto failed_kmap;
+   }
  
-	amdgpu_bo_unreserve(*bo);

+   amdgpu_bo_unreserve(*bo);
+   }
  
-	memset(fw_buf_ptr, 0, adev->firmware.fw_size);

+   memset(adev->firmware.fw_buf_ptr, 0, adev->firmware.fw_size);
  
  	/*

 * if SMU loaded firmware, it needn't add SMC, UVD, and VCE
@@ -421,14 +421,14 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
ucode = >firmware.ucode[i];
if (ucode->fw) {
header = (const struct common_firmware_header 
*)ucode->fw->data;
-   amdgpu_ucode_init_single_fw(adev, ucode, fw_mc_addr + 
fw_offset,
-   (void *)((uint8_t 
*)fw_buf_ptr + fw_offset));
+