Re: [RFC 0/5] Discussion around eviction improvements

Christian König Tue, 14 May 2024 08:47:46 -0700

Am 14.05.24 um 17:14 schrieb Tvrtko Ursulin:

On 13/05/2024 14:49, Tvrtko Ursulin wrote:
On 09/05/2024 13:40, Tvrtko Ursulin wrote:
On 08/05/2024 19:09, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursu...@igalia.com>
Last few days I was looking at the situation with VRAM oversubscription, whathappens versus what perhaps should happen. Browsing through thedriver and
running some simple experiments.
I ended up with this patch series which, as a disclaimer, may becompletelywrong but as I found some suspicious things, to me at least, Ithought it was a
good point to stop and request some comments.

To perhaps summarise what are the main issues I think I found:
* Migration rate limiting does not bother knowing if actualmigration happened
    and so can over-account and unfairly penalise.
* Migration rate limiting does not even work, at least not forthe common case where userspace configures VRAM+GTT. It thinks it can stopmigration attempts by playing with bo->allowed_domains vs bo->preferred domainsbut, both from the code, and from empirical experiments, I see that notworking at all. Both
    masks are identical so fiddling with them achieves nothing.
* Idea of the fallback placement only works when VRAM has freespace. As soon as it does not, ttm_resource_compatible is happy to leave thebuffers in the
    secondary placement forever.
* Driver thinks it will be re-validating evicted buffers on thenext submission but it does not for the very common case of VRAM+GTT because itonly checks
    if current placement is *none* of the preferred placements.

All those problems are addressed in individual patches.
End result of this series appears to be driver which will tryharder to movebuffers back into VRAM, but will be (more) correctly throttled indoing so by
the existing rate limiting logic.
I have run a quick benchmark of Cyberpunk 2077 and cannot say thatI saw achange but that could be a good thing too. At least I did not breakanything,perhaps.. On one occassion I did see the rate limiting logic getconfused whilefor a period of few minutes it went to a mode where it wasconstantly giving ahigh migration budget. But that recovered itself when I switchedclients and didnot come back so I don't know. If there is something wrong there Idon't think
it would be caused by any patches in this series.
Since yesterday I also briefly tested with Far Cry New Dawn. One runeach so possibly doesn't mean anything apart that there isn't aregression aka migration throttling is keeping things at bay evenwith increased requests to migrate things back to VRAM:
              before         after
min/avg/max fps        36/44/54        37/45/55

Cyberpunk 2077 from yesterday was similarly close:

         26.96/29.59/30.40    29.70/30.00/30.32
I guess the real story is proper DGPU where misplaced buffers have areal cost.
I found one game which regresses spectacularly badly with this series- Assasin's Creed Valhalla. The built-in benchmark at least. The gameappears to have a working set much larger than the other games Itested, around 5GiB total during the benchmark. And for some reasonmigration throttling totally fails to put it in check. I will beinvestigating this shortly.
I think that the conclusion is everything I attempted to add relatingto TTM_PL_PREFERRED does not really work as I initially thought itdid. Therefore please imagine this series as only containing patches1, 2 and 5.


Noted (and I had just started to wrap my head around that idea).

(And FWIW it was quite annoying to get to the bottom of since for somereason the system exibits some sort of a latching behaviour, where onsome boots and/or some minutes of runtime things were fine, and thenit would latch onto a mode where the TTM_PL_PREFERRED induced breakagewould show. And sometimes this breakage would appear straight away. Odd.)

Welcome to my world. You improve one use case and four other get apenalty. Even when you know the code and potential use cases inside outit's really hard to predict how some applications and the core memorymanagement behave sometimes.

I still need to test though if the subset of patches manage to achievesome positive improvement on their own. It is possible, as patch 5marks more buffers for re-validation so once overcommit subsides theywould get promoted to preferred placement straight away. And 1&2 arenotionally fixes for migration throttling so at least in broad senseshould be still valid as discussion points.

Yeah, especially 5 kind of makes sense but could potentially lead tohigher overhead. Need to see how we can better handle that.


Regards,
Christian.


Regards,

Tvrtko

Series is probably rough but should be good enough for dicsussion.I am curiousto hear if I identified at least something correctly as a realproblem.

It would also be good to hear what are the suggested games to checkand see

whether there is any improvement.

Cc: Christian König <christian.koe...@amd.com>
Cc: Friedrich Vock <friedrich.v...@gmx.de>

Tvrtko Ursulin (5):
   drm/amdgpu: Fix migration rate limiting accounting
   drm/amdgpu: Actually respect buffer migration budget
   drm/ttm: Add preferred placement flag
   drm/amdgpu: Use preferred placement for VRAM+GTT
   drm/amdgpu: Re-validate evicted buffers

drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38+++++++++++++++++-----

  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 21 ++++++++++--
  drivers/gpu/drm/ttm/ttm_resource.c         | 13 +++++---
  include/drm/ttm/ttm_placement.h            |  3 ++
  5 files changed, 65 insertions(+), 18 deletions(-)

Re: [RFC 0/5] Discussion around eviction improvements

Reply via email to