Re: [PATCH v3] drm/amdgpu: user pages array memory leak fix

2019-10-13 Thread Joe Barnett
Confirming that v3 patch still fixes the bug.

Thanks,
-Joe

On Fri, Oct 11, 2019 at 7:36 AM Yang, Philip  wrote:

> user_pages array should always be freed after validation regardless if
> user pages are changed after bo is created because with HMM change parse
> bo always allocate user pages array to get user pages for userptr bo.
>
> v2: remove unused local variable and amend commit
>
> v3: add back get user pages in gem_userptr_ioctl, to detect application
> bug where an userptr VMA is not ananymous memory and reject it.
>
> Bugzilla: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1844962
>
> Signed-off-by: Philip Yang 
> Tested-by: Joe Barnett 
> Reviewed-by: Christian König 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index c18a153b3d2a..e7b39daa22f6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -476,7 +476,6 @@ static int amdgpu_cs_list_validate(struct
> amdgpu_cs_parser *p,
>
> list_for_each_entry(lobj, validated, tv.head) {
> struct amdgpu_bo *bo = ttm_to_amdgpu_bo(lobj->tv.bo);
> -   bool binding_userptr = false;
> struct mm_struct *usermm;
>
> usermm = amdgpu_ttm_tt_get_usermm(bo->tbo.ttm);
> @@ -493,14 +492,13 @@ static int amdgpu_cs_list_validate(struct
> amdgpu_cs_parser *p,
>
> amdgpu_ttm_tt_set_user_pages(bo->tbo.ttm,
>  lobj->user_pages);
> -   binding_userptr = true;
> }
>
> r = amdgpu_cs_validate(p, bo);
> if (r)
> return r;
>
> -   if (binding_userptr) {
> +   if (lobj->user_pages) {
> kvfree(lobj->user_pages);
> lobj->user_pages = NULL;
> }
> --
> 2.17.1
>
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: user pages array memory leak fix

2019-10-10 Thread Joe Barnett
Thanks for looking at this Christian.  Let me know if there's anything I
can do to help.

In the meantime, is there a workaround to avoid the memory leak other than
using a kernel from before the HMM change?

Thanks,
-Joe

On Fri, Oct 4, 2019 at 8:02 AM Koenig, Christian 
wrote:

> Hi Philip,
>
> Am 04.10.19 um 15:40 schrieb Yang, Philip:
> > Thanks Joe for the test, I will add your Tested-by.
> >
> > Hi Christian,
> >
> > May you help review? The change removes the get user pages from
> > gem_userptr_ioctl, this was done if flags AMDGPU_GEM_USERPTR_VALIDATE is
> > set, and delay the get user pages to amdgpu_cs_parser_bos, and check if
> > user pages are invalidated when amdgpu_cs_submit. I don't find issue for
> > overnight test, but not sure if there is potential side effect.
>
> Yeah, seen that.
>
> The AMDGPU_GEM_USERPTR_VALIDATE was explicitly added to cause a
> validation during BO creation because of some very stupid applications.
>
> I didn't wanted to reject that without offering an alternative, but we
> seriously can't do this or it would break Vulkan/OpenGL.
>
> I need more time to take a closer look,
> Christian.
>
> >
> > Thanks,
> > Philip
> >
> > On 2019-10-03 3:44 p.m., Yang, Philip wrote:
> >> user_pages array should always be freed after validation regardless if
> >> user pages are changed after bo is created because with HMM change parse
> >> bo always allocate user pages array to get user pages for userptr bo.
> >>
> >> Don't need to get user pages while creating uerptr bo because user pages
> >> will only be used while validating after parsing userptr bo.
> >>
> >> Bugzilla: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1844962
> >>
> >> v2: remove unused local variable and amend commit
> >>
> >> Signed-off-by: Philip Yang 
> >> ---
> >>drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  4 +---
> >>drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 23 +--
> >>2 files changed, 2 insertions(+), 25 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >> index 49b767b7238f..961186e7113e 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >> @@ -474,7 +474,6 @@ static int amdgpu_cs_list_validate(struct
> amdgpu_cs_parser *p,
> >>
> >>  list_for_each_entry(lobj, validated, tv.head) {
> >>  struct amdgpu_bo *bo = ttm_to_amdgpu_bo(lobj->tv.bo);
> >> -bool binding_userptr = false;
> >>  struct mm_struct *usermm;
> >>
> >>  usermm = amdgpu_ttm_tt_get_usermm(bo->tbo.ttm);
> >> @@ -491,14 +490,13 @@ static int amdgpu_cs_list_validate(struct
> amdgpu_cs_parser *p,
> >>
> >>  amdgpu_ttm_tt_set_user_pages(bo->tbo.ttm,
> >>   lobj->user_pages);
> >> -binding_userptr = true;
> >>  }
> >>
> >>  r = amdgpu_cs_validate(p, bo);
> >>  if (r)
> >>  return r;
> >>
> >> -if (binding_userptr) {
> >> +if (lobj->user_pages) {
> >>  kvfree(lobj->user_pages);
> >>  lobj->user_pages = NULL;
> >>  }
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >> index a828e3d0bfbd..3ccd61d69964 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >> @@ -283,7 +283,6 @@ int amdgpu_gem_create_ioctl(struct drm_device *dev,
> void *data,
> >>int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void *data,
> >>   struct drm_file *filp)
> >>{
> >> -struct ttm_operation_ctx ctx = { true, false };
> >>  struct amdgpu_device *adev = dev->dev_private;
> >>  struct drm_amdgpu_gem_userptr *args = data;
> >>  struct drm_gem_object *gobj;
> >> @@ -326,32 +325,12 @@ int amdgpu_gem_userptr_ioctl(struct drm_device
> *dev, void *data,
> >>  goto release_object;
> >>  }
> >>
> >> -if (args->flags & AMDGPU_GEM_USERPTR_VALIDATE) {
> >> -r = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages);
> >> -if (r)
> >> -goto release_object;
> >> -
> >> -r = amdgpu_bo_reserve(bo, true);
> >> -if (r)
> >> -goto user_pages_done;
> >> -
> >> -amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT);
> >> -r = ttm_bo_validate(>tbo, >placement, );
> >> -amdgpu_bo_unreserve(bo);
> >> -if (r)
> >> -goto user_pages_done;
> >> -}
> >> -
> >>  r = drm_gem_handle_create(filp, gobj, );
> >>  if (r)
> >> -goto user_pages_done;
> >> +goto release_object;
> >>
> >>  args->handle = handle;
> >>
> >> -user_pages_done:
> >> -if (args->flags & AMDGPU_GEM_USERPTR_VALIDATE)
> >> -  

Re: [PATCH] drm/amdgpu: user pages array memory leak fix

2019-10-03 Thread Joe Barnett
I've tested applying v2 of this patch against a v5.3 tagged kernel and it
appears to fix the issue I reported.

Thanks,
-Joe

On Thu, Oct 3, 2019 at 12:07 PM Yang, Philip  wrote:

> user_pages array should be freed regardless if user pages are
> invalidated after bo is created because HMM change to always allocate
> user pages array to get user pages while parsing user page bo.
>
> Don't need to to get user pages while creating bo because user pages
> will only be used after parsing user page bo.
>
> Bugzilla: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1844962
>
> Signed-off-by: Philip Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 23 +--
>  2 files changed, 2 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 49b767b7238f..e861de259def 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -498,7 +498,7 @@ static int amdgpu_cs_list_validate(struct
> amdgpu_cs_parser *p,
> if (r)
> return r;
>
> -   if (binding_userptr) {
> +   if (lobj->user_pages) {
> kvfree(lobj->user_pages);
> lobj->user_pages = NULL;
> }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index a828e3d0bfbd..3ccd61d69964 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -283,7 +283,6 @@ int amdgpu_gem_create_ioctl(struct drm_device *dev,
> void *data,
>  int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void *data,
>  struct drm_file *filp)
>  {
> -   struct ttm_operation_ctx ctx = { true, false };
> struct amdgpu_device *adev = dev->dev_private;
> struct drm_amdgpu_gem_userptr *args = data;
> struct drm_gem_object *gobj;
> @@ -326,32 +325,12 @@ int amdgpu_gem_userptr_ioctl(struct drm_device *dev,
> void *data,
> goto release_object;
> }
>
> -   if (args->flags & AMDGPU_GEM_USERPTR_VALIDATE) {
> -   r = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages);
> -   if (r)
> -   goto release_object;
> -
> -   r = amdgpu_bo_reserve(bo, true);
> -   if (r)
> -   goto user_pages_done;
> -
> -   amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT);
> -   r = ttm_bo_validate(>tbo, >placement, );
> -   amdgpu_bo_unreserve(bo);
> -   if (r)
> -   goto user_pages_done;
> -   }
> -
> r = drm_gem_handle_create(filp, gobj, );
> if (r)
> -   goto user_pages_done;
> +   goto release_object;
>
> args->handle = handle;
>
> -user_pages_done:
> -   if (args->flags & AMDGPU_GEM_USERPTR_VALIDATE)
> -   amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
> -
>  release_object:
> drm_gem_object_put_unlocked(gobj);
>
> --
> 2.17.1
>
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

HMM related? memory leakage

2019-10-01 Thread Joe Barnett
In the development version of ubuntu, using a 5.3 kernel, running the
dolphin emulator appears to leak memory (there may be other apps that
trigger the same issue, but haven't run into them).  The "used" memory
reported by top grows until the app exits, and does not get freed at that
time.  This is on a dell xps 2-in-1 with hybrid intel/amd-vega-m graphics,
and DRI_PRIME=1.  Some more details at downstream bug report:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1844962

Running a kernel bisection to find the offending commit shows this:

899fbde1464639e3d12eaffdad8481a59b367fcb is the first bad commit
commit 899fbde1464639e3d12eaffdad8481a59b367fcb
Author: Philip Yang 
Date:   Thu Dec 13 15:35:28 2018 -0500

drm/amdgpu: replace get_user_pages with HMM mirror helpers

Use HMM helper function hmm_vma_fault() to get physical pages backing
userptr and start CPU page table update track of those pages. Then use
hmm_vma_range_done() to check if those pages are updated before
amdgpu_cs_submit for gfx or before user queues are resumed for kfd.

If userptr pages are updated, for gfx, amdgpu_cs_ioctl will restart
from scratch, for kfd, restore worker is rescheduled to retry.

HMM simplify the CPU page table concurrent update check, so remove
guptasklock, mmu_invalidations, last_set_pages fields from
amdgpu_ttm_tt struct.

HMM does not pin the page (increase page ref count), so remove related
operations like release_pages(), put_page(), mark_page_dirty().

Signed-off-by: Philip Yang 
Reviewed-by: Felix Kuehling 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 

:04 04 0c9f0e2e82e5e4d2d3a4c0daea22eb911244b771
fdcdc7c80f5383486962edf4561e205b55bd8c21 M drivers


$ git bisect log
# bad: [f74c2bb98776e2de508f4d607cd519873065118e] Linux 5.3-rc8
# good: [1c163f4c7b3f621efff9b28a47abb36f7378d783] Linux 5.0
git bisect start 'v5.3-rc8' 'v5.0'
# good: [a2d635decbfa9c1e4ae15cb05b68b2559f7f827c] Merge tag
'drm-next-2019-05-09' of git://anongit.freedesktop.org/drm/drm
git bisect good a2d635decbfa9c1e4ae15cb05b68b2559f7f827c
# good: [a2d635decbfa9c1e4ae15cb05b68b2559f7f827c] Merge tag
'drm-next-2019-05-09' of git://anongit.freedesktop.org/drm/drm
git bisect good a2d635decbfa9c1e4ae15cb05b68b2559f7f827c
# good: [8f6ccf6159aed1f04c6d179f61f6fb2691261e84] Merge tag 'clone3-v5.3'
of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
git bisect good 8f6ccf6159aed1f04c6d179f61f6fb2691261e84
# good: [8f6ccf6159aed1f04c6d179f61f6fb2691261e84] Merge tag 'clone3-v5.3'
of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
git bisect good 8f6ccf6159aed1f04c6d179f61f6fb2691261e84
# bad: [be8454afc50f43016ca8b6130d9673bdd0bd56ec] Merge tag
'drm-next-2019-07-16' of git://anongit.freedesktop.org/drm/drm
git bisect bad be8454afc50f43016ca8b6130d9673bdd0bd56ec
# bad: [be8454afc50f43016ca8b6130d9673bdd0bd56ec] Merge tag
'drm-next-2019-07-16' of git://anongit.freedesktop.org/drm/drm
git bisect bad be8454afc50f43016ca8b6130d9673bdd0bd56ec
# good: [d72619706abc4aa7e540ea882dae883cee7cc3b3] Merge tag 'tty-5.3-rc1'
of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect good d72619706abc4aa7e540ea882dae883cee7cc3b3
# bad: [83145f110eb2ada9d54fcbcf416c02de126381c1] drm/amdgpu: don't
invalidate caches in RELEASE_MEM, only do the writeback
git bisect bad 83145f110eb2ada9d54fcbcf416c02de126381c1
# bad: [b239c01727459ba08c44b79e6225d3c58723f282] drm/amdgpu: add mcbp
driver parameter
git bisect bad b239c01727459ba08c44b79e6225d3c58723f282
# good: [e1dc68a4b149d47536cd001d0d0abadbb62d37bd] drm: atmel-hlcdc: avoid
initializing cfg with zero
git bisect good e1dc68a4b149d47536cd001d0d0abadbb62d37bd
# bad: [c53e4db71276bf257b09010935a04bdafddd458e] drm/amdgpu: cancel
late_init_work before gpu reset
git bisect bad c53e4db71276bf257b09010935a04bdafddd458e
# good: [2da4605dce38b84cd2e5b86686f43adae1b2cacb] drm/amd/display: Use DCN
functions instead of DCE
git bisect good 2da4605dce38b84cd2e5b86686f43adae1b2cacb
# bad: [1c1e53f7f2ce191e6787d3d0648fe8ce7088ceaa] drm/amd/doc: Add XGMI
sysfs documentation
git bisect bad 1c1e53f7f2ce191e6787d3d0648fe8ce7088ceaa
# good: [89cd9d23e9a74d94f0db5bbbaf2ef1f6ede36ae5] drm/amdkfd: avoid HMM
change cause circular lock
git bisect good 89cd9d23e9a74d94f0db5bbbaf2ef1f6ede36ae5
# bad: [0803e7a9e850f9d6397c594d6c6deac9b2b6d696] drm/amdkfd: Allocate hiq
and sdma mqd from mqd trunk
git bisect bad 0803e7a9e850f9d6397c594d6c6deac9b2b6d696
# bad: [972fcdb52fe865a2f639e3200b97e648f34a0f41] drm/amdkfd: Introduce
asic-specific mqd_manager_init function
git bisect bad 972fcdb52fe865a2f639e3200b97e648f34a0f41
# bad: [6c55d6e90e68a4789cbd72a0287026d4dfb4a9f9] drm/amdkfd: support
concurrent userptr update for HMM
git bisect bad 6c55d6e90e68a4789cbd72a0287026d4dfb4a9f9
# bad: [ad595b8634f36f04bf69bef4eff854091d94f8b3] drm/amdgpu: fix HMM
config dependency issue
git bisect bad ad595b8634f36f04bf69bef4eff854091d94f8b3
# bad: