Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Christian König Thu, 20 Apr 2017 01:44:10 -0700

Hi AlexBin,

Missing kunmap mapping in vmalloc will make kernel master page tableincorrect.

That's what I tried to explain yesterday, but unfortunately didn't hadtime to do so. There is not corruption of the kernel master page tablein this case!

The call of ttm_bo_kunmap is completely optional, take a look atamdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free().

The aperture is kept mapped into the page tables for the whole time thedriver is loaded. So this is a complete no-op and only done for consistency.

It is good that you agree that there is no real world bad examplecaused by my patch. I will not discuss whether it is an improvement ornot now to save time for both of us.

Great at least we can now agree to completely drop this patch.

Thanks,
Christian.

Am 19.04.2017 um 21:30 schrieb Xie, AlexBin:

Hi Christian,
Missing kunmap mapping in vmalloc will make kernel master page tableincorrect. I would not call such issue as completely harmless. Pleasenote that AMD graphic driver can run in 32 bit system. In 32 bitsystem, vmalloc address space is much smaller than size of most GPU VRAM.
amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1.It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap whenamdgpu_bo_reserve returns error too. As result, kernel master pagetable can become incorrect, or as you call it "completely harmlessvmalloc space leaking".
Because amdgpu_bo_free_kernel is used in more places, such as pspcommand submission, there will be bigger chance to have other usagewhere signal is not blocked. This will become a real bug.
I am thinking that we may fix the issue completely when TTM releasesBO. But that is a bigger change.
It is good that you agree that there is no real world bad examplecaused by my patch. I will not discuss whether it is an improvement ornot now to save time for both of us.
Thanks,

Alex Bin Xie


------------------------------------------------------------------------
*From:* Christian König <deathsim...@vodafone.de>
*Sent:* Wednesday, April 19, 2017 7:50 AM
*To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
*Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
Without correctly kunmap, page table is corrupted. Page entries pointto wrong memory locations. You might call it completely harmless. ButI think this is a severe bug. Leaking memory is better than acorrupted page table. Think security.
We are talking about the page tables for the vmalloc area in thekernel here, so no security problem. Leaking memory is much moreproblematic.
Would you provide any document and reference by saying" It isimpossible to receive a signal during module load/unload"? Forexample, if the unload stuck in a lock, can CTRL+C stop the unload?
No, CTRL+C doesn't abort module load/unload. There have been patchesto changes this a while ago, but IIRC it broke a whole bunch of driverrelying on this.
What about there is some other return error? What about in futuresomebody improve amdgpu_bo_reserve to return other errors,then function amdgpu_vram_scratch_fini becomes buggy?
Yes, that is indeed an issue. For example -EDEADLK is possible aswell. That's why I said we should use amdgpu_bo_free_kernel() instead.
While I am thinking whether there is a better way for the currentsituation, would you give a real world example that my patch reallynot working? Then we can address it.
I don't think there is because the driver can't receive a signalduring load/unload, but the problem is rather that the patch doesn'timprove the situation at all.
Regards,
Christian.

Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:
Hi Christian,
Without correctly kunmap, page table is corrupted. Page entries pointto wrong memory locations. You might call it completely harmless. ButI think this is a severe bug. Leaking memory is better than acorrupted page table. Think security.
Would you provide any document and reference by saying" It isimpossible to receive a signal during module load/unload"? Forexample, if the unload stuck in a lock, can CTRL+C stop the unload?
If "It is impossible to receive a signal during module load/unload",interruptible waiting is fine too, because function amdgpu_bo_reservewill return successfully.
What about there is some other return error? What about in futuresomebody improve amdgpu_bo_reserve to return other errors,then function amdgpu_vram_scratch_fini becomes buggy?
While I am thinking whether there is a better way for the currentsituation, would you give a real world example that my patch reallynot working? Then we can address it.
Thanks,

Alex Bin


------------------------------------------------------------------------
*From:* Christian König <deathsim...@vodafone.de>
*Sent:* Wednesday, April 19, 2017 2:35 AM
*To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
*Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
Hi AlexBin,
the answer is ttm_bo_kunmap isn't called at all and yes in the caseof an iomap we leak the address space reserved.
But that is completely harmless on a 64bit system compared to leakingthe memory backing the address space.
Using amdgpu_bo_free_kernel() instead of openly coding it here isprobably a good idea.
Additional to that it's probably a good idea to set the no_intr flagwhen reserving kernel BOs. It is impossible to receive a signalduring module load/unload, but it's probably better to document thatin the code as well.
Regards,
Christian.

Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
Hi Christian,
Have you found how/where/when? When you said "mapping will just bereleased a bit later on", you must know the answer.
It is difficult to prove something does not exist. Anyway, I willgive it a try to prove such "later on" does not exist.
Function ttm_bo_kunmap is the only function to unmap. To prove this,search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum tocorrectly kunmap.
Function ttm_bo_kunmap is not called by ttm itself. This is a hintthat all TTM delay delete mechanism or unref mechanism will NOTkunmap BO later on.
Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmapand amdgpu_gem_prime_vunmap.
Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap forscratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answeris still NO.
So all possibilities are searched. Did I miss anything?

Thanks,
Alex Bin Xie

------------------------------------------------------------------------
*From:* Xie, AlexBin
*Sent:* Tuesday, April 18, 2017 2:04:33 PM
*To:* Christian König; Zhou, David(ChunMing);amd-gfx@lists.freedesktop.org
*Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi Christian,
Would you point out where/when will kunmap happen for this BO whenrelease? It must be somewhere in some function calls.
I checked before I asked for review. But I did not see such obviouskunmap function call.
If so, there should be a comment in functionamdgpu_vram_scratch_fini to avoid future confusion.
Thanks,
Alex Bin Xie
------------------------------------------------------------------------
*From:* Christian König <deathsim...@vodafone.de>
*Sent:* Tuesday, April 18, 2017 1:46 PM
*To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
*Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
Hi AlexBin,
No, David is right. This is a very common coding pattern in thekernel module.
Freeing up a BO while there still exists a kernel mapping isperfectly ok, the mapping will just be released a bit later on.
So this code is actually perfectly ok and just an optimization, butyour patch breaks it and creates a memory leak.
Regards,
Christian.

Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
Hi David,
When amdgpu_bo_reserve return errors, we cannot release the BO.This is not a memory leak. General speaking, memory leakis unnoticed and unintentional.
The caller of function amdgpu_vram_scratch_fini ignores the returnerror value...
The "memory leak" is not caused by my patch. It is caused becausereserving BO fails.
This patch only aim to make function amdgpu_vram_scratch_finibehave correctly.
To follow up, we can add a warning message when amdgpu_bo_reserveerror happens in a different patch.
If function call amdgpu_bo_reserve is changed to uninterruptible,this changes driver behaviour. Without a substantial issue, I wouldbe cautious for such a change.
Thanks,

Alex Bin Xie


------------------------------------------------------------------------
*From:* Zhou, David(ChunMing)
*Sent:* Monday, April 17, 2017 10:38 PM
*To:* Xie, AlexBin; amd-gfx@lists.freedesktop.org
*Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO


On 2017年04月17日 22:54, Xie, AlexBin wrote:
Hi David,
Thanks for the comments. However, please have look atamdgpu_bo_reserve definition.
static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, boolno_intr)
Ah, this is a wired wrapper for ttm_bo_reserve.
When we call this function like the following:

     r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.
On the other hand, when amdgpu_bo_reserve function return error,why do we unref BO without kunmap and unpin the BO? This is wrongimplementation when amdgpu_bo_reserve return any error.
Yeah, I see your mean, it's in driver un-loading, How aboutchanging to no interruptible? Your patch will make a memleak ifbo_reserve fails, but it seems not matter. I have no strong preference.
Regards,
David Zhou
Thanks,
Alex Bin Xie

------------------------------------------------------------------------
*From:* Zhou, David(ChunMing)
*Sent:* Friday, April 14, 2017 12:00 AM
*To:* Xie, AlexBin; amd-gfx@lists.freedesktop.org
*Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO


On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn'tpossible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <alexbin....@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.cb/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(structamdgpu_device *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> + amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> - amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Reply via email to