suballoc: Introduce a generic suballocation manager

Thomas Hellström Wed, 22 Feb 2023 07:58:57 -0800


On 2/22/23 15:20, Christian König wrote:

Am 22.02.23 um 14:54 schrieb Thomas Hellström:
Hi,

On 2/22/23 12:39, Christian König wrote:
Hi Thomas,

Am 22.02.23 um 12:00 schrieb Thomas Hellström:
Hi, Christian,
So I resurrected Maarten's previous patch series around this (theamdgpu suballocator) slightly modified the code to match the API ofthis patch series, re-introduced the per-allocation alignment asper a previous review comment from you on that series, and madecheckpatch.pl pass mostly, except for pre-existing style problems,and added / fixed some comments. No memory corruption seen so faron limited Xe testing.
To move this forward I suggest starting with that as a common drmsuballocator. I'll post the series later today. We can follow upwith potential simplifactions lif needed.
I also made a kunit test also reporting some timing information.Will post that as a follow up. Some interesting preliminaryconclusions:
* drm_mm is per se not a cpu hog, If the rb tree processing isdisabled and the EVICT algorithm is changed from MRU to ring-likeLRU traversal, it's more or less just as fast as the ringsuballocator.
* With a single ring, and the suballocation buffer never completelyfilled (no sleeps) the amd suballocator is a bit faster perallocation / free. (Around 250 ns instead of 350). Allocation isslightly slower on the amdgpu one, freeing is faster, mostly due tothe locking overhead incurred when setting up the fence callbacks,and for avoiding irq-disabled processing on the one I proposed.
For some more realistic numbers try to signal the fence from anotherCPU. Alternative you can invalidate all the CPU read cache linestouched by the fence callback so that they need to be read in againfrom the allocating CPU.
Fences are signalled using hr-timer driven fake "ring"s, so shouldprobably be distributed among cpus in a pretty realistic way. Butanyway I agree results obtained from that kunit test can and shouldbe challenged before we actually use them for improvements.
I would double check that. My expectation is that hr-timers execute bydefault on the CPU from which they are started.

Hmm, since not using the _PINNED hrtimer flag, I'd expect them to bemore distributed but you're right, they weren't. A rather fewtimer_expires from other cpus only. So figures for signalling on othercpus are, around 500ns for the amdgpu variant, around 900 ns for thefence-callback one. Still, sleeping starts around 50-75% fill with theamdgpu variant.

* With multiple rings and varying allocation sizes and signallingtimes creating fragmentation, the picture becomes different as theamdgpu allocator starts to sleep/throttle already round 50% - 75%fill. The one I proposed between 75% to 90% fill, and once thathappens, the CPU cost of putting to sleep and waking up shouldreally shadow the above numbers.
So it's really a tradeoff. Where IMO also code size andmaintainability should play a role.
Also I looked at the history of the amdgpu allocator originatingback to Radeon 2012-ish, but couldn't find any commits mentioningfence callbacks nor problem with those. Could you point me to thatdiscussion?
Uff that was ~10 years ago. I don't think I can find that again.
OK, fair enough. But what was the objective reasoning against usingfence callbacks for this sort of stuff, was it unforeseen lockingproblems, caching issues or something else?
Well caching line bouncing is one major problem. Also take a look atthe discussion about using list_head in interrupt handlers, thatshould be easy to find on LWN.
The allocator usually manages enough memory so that it never runs intowaiting for anything, only in extreme cases like GPU resets weactually wait for allocations to be freed.

I guess this varies with the application, but can be remedied with justadding more managed memory if needed.


/Thomas

So the only cache lines which is accessed from more than one CPUshould be the signaled flag of the fence.
With moving list work into the interrupt handler you have at least 3cache lines which start to bounce between different CPUs.
Regards,
Christian.
Thanks,

Thomas
Regards,
Christian.
Thanks,

Thomas



On 2/17/23 14:51, Thomas Hellström wrote:
On 2/17/23 14:18, Christian König wrote:
Am 17.02.23 um 14:10 schrieb Thomas Hellström:
[SNIP]
Any chance you could do a quick performance comparison? Ifnot, anything against merging this without the amd / radeonchanges until we can land a simpler allocator?
Only if you can stick the allocator inside Xe and not drm,cause this seems to be for a different use case than theallocators inside radeon/amdgpu.
Hmm. No It's allocating in a ring-like fashion as well. Let meput together a unit test for benchmaking. I think it would bea failure for the community to end up with three separatesuballocators doing the exact same thing for the same problem,really.
Well exactly that's the point. Those allocators aren't the samebecause they handle different problems.
The allocator in radeon is simpler because it only had to dealwith a limited number of fence timelines. The one in amdgpu isa bit more complex because of the added complexity for morefence timelines.
We could take the one from amdgpu and use it for radeon andothers as well, but the allocator proposed here doesn't evenremotely matches the requirements.
But again, what *are* those missing requirements exactly? Whatis the pathological case you see for the current code?
Well very low CPU overhead and don't do anything in a callback.
Well, dma_fence_wait_any() will IIRC register callbacks on allaffected fences, although admittedly there is no actual allocatorprocessing in them.
From what I can tell the amdgpu suballocator introducesexcessive complexity to coalesce waits for fences from the samecontexts, whereas the present code just frees from the fencecallback if the fence wasn't already signaled.
And this is exactly the design we had previously which we removedafter Dave stumbled over tons of problems with it.
So is the worry that those problems have spilled over in this codethen? It's been pretty extensively tested, or is it you shouldnever really use dma-fence callbacks?
The fence signalling code that fires that callback is typcallyalways run anyway on scheduler fences.
The reason we had for not using the amdgpu suballocator asoriginally planned was that this complexity made it very hardfor us to undertand it and to fix issues we had with it.
Well what are those problems? The idea is actually not thathardware to understand.
We hit memory corruption, and we spent substantially more timetrying to debug it than to put together this patch, while neverreally understanding what happened, nor why you don't see thatwith amdgpu.
We could simplify it massively for the cost of only waiting forthe oldest fence if that helps.
Let me grab the latest version from amdgpu and give it a tryagain, but yes I think that to make it common code we'll need itsimpler (and my personal wish would be to separate the allocatorfunctionality a bit more from the fence waiting, which I guessshould be OK if the fence waiting is vastly simplified).
/Thomas
Regards,
Christian.
Regards,

Thomas

Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager

Reply via email to