Hi, Although I haven't had a chance yet to eye through the current code, some comments along the way.
On Thu, 2024-05-02 at 10:04 +0200, Daniel Vetter wrote: > On Tue, Apr 30, 2024 at 09:09:15PM -0300, Jason Gunthorpe wrote: > > On Tue, Apr 30, 2024 at 08:57:48PM +0200, Daniel Vetter wrote: > > > On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote: > > > > On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström > > > > wrote: > > > > > > > > > > Yes there is another common scheme where you bind a window > > > > > > of CPU to > > > > > > a > > > > > > window on the device and mirror a fixed range, but this is > > > > > > a quite > > > > > > different thing. It is not SVA, it has a fixed range, and > > > > > > it is > > > > > > probably bound to a single GPU VMA in a multi-VMA device > > > > > > page table. > > > > > > > > > > And this above here is exactly what we're implementing, and > > > > > the GPU > > > > > page-tables are populated using device faults. Regions > > > > > (large) of the > > > > > mirrored CPU mm need to coexist in the same GPU vm as > > > > > traditional GPU > > > > > buffer objects. > > > > > > > > Well, not really, if that was the case you'd have a single VMA > > > > over > > > > the entire bound range, not dynamically create them. > > > > > > > > A single VMA that uses hmm_range_fault() to populate the VM is > > > > completely logical. > > > > > > > > Having a hidden range of mm binding and then > > > > creating/destroying 2M > > > > VMAs dynamicaly is the thing that doesn't make alot of sense. > > > > > > I only noticed this thread now but fyi I did dig around in the > > > implementation and it's summarily an absolute no-go imo for > > > multiple > > > reasons. It starts with this approach of trying to mirror cpu vma > > > (which I > > > think originated from amdkfd) leading to all kinds of locking > > > fun, and > > > then it gets substantially worse when you dig into the details. It's true the cpu vma lookup is a remnant from amdkfd. The idea here is to replace that with fixed prefaulting ranges of tunable size. So far, as you mention, the prefaulting range has been determined by the CPU vma size. Given previous feedback, this is going to change. Still the prefaulting range needs to be restricted to avoid -EFAULT failures in hmm_range_fault(). That can ofc be done by calling it without HMM_PFN_REQ_FAULT for the range and interpret the returned pnfs. There is a performance concern of this approach as compared to peeking at the CPU vmas directly, since hmm_range_fault() would need to be called twice. Any guidelines ideas here? The second aspect of this is gpu_vma creation / splitting on fault that the current implementation has. The plan is to get rid of that as well, in favour of a sparsely populated gpu_vma. The reason for this implementation was the easy integration with drm_gpuvm. Still, I don't see any locking problems here, though, maintaining gpu_vm->lock mmap_lock dma_resv reclaim notifier_lock throughout the code. What is likely to get somewhat problematic, though, is VRAM eviction. /Thomas > > s the main reason I replied, it felt a > bit like the thread was derailing in details that don't yet matter. > > Thanks, Sima /Thomas