On 1/28/2026 7:04 AM, Danilo Krummrich wrote:
On Fri Jan 23, 2026 at 12:16 AM CET, Joel Fernandes wrote:
My plan is to make TLB and PRAMIN use immutable references in their function
calls and then implement internal locking. I've already done this for the GPU
buddy functions, so it should be doable, and we'll keep it consistent. As a
result, we will have finer-grain locking on the memory management objects
instead of requiring to globally lock a common GpuMm object. I'll plan on
doing this for v7.

Also, the PTE allocation race you mentioned is already handled by PRAMIN
serialization. Since threads must hold the PRAMIN lock to write page table
entries, concurrent writers are not possible:

   Thread A: acquire PRAMIN lock
   Thread A: read PDE (via PRAMIN) -> NULL
   Thread A: alloc PT page, write PDE
   Thread A: release PRAMIN lock

   Thread B: acquire PRAMIN lock
   Thread B: read PDE (via PRAMIN) -> sees A's pointer
   Thread B: uses existing PT page, no allocation needed

This won't work unfortunately.

We have to separate allocations and modifications of the page tabe. Or in other
words, we must not allocate new PDEs or PTEs while holding the lock protecting
the page table from modifications.

I will go over these concerns, just to clarify - do you mean forbidding *any* lock or do you mean only forbidding non-atomic locks? I believe we can avoid non-atomic locks completely - actually I just wrote a patch before I read this email to do just. If we are to forbid any locking at all, that might require some careful redesign to handle the above race afaics.


Once we have VM_BIND in nova-drm, we will have the situation that userspace
passes jobs to modify the GPUs virtual address space and hence the page tables.

Thanks for listing all the concerns below, this is very valuable. I will go over all these and all cases before posting the v7 now that I have this.

--
Joel Fernandes


Such a jobs has mainly three stages.

   (1) The submit stage.

       This is where the job is initialized, dependencies are set up and the
       driver has to pre-allocate all kinds of structures that are required
       throughout the subsequent stages of the job.

   (2) The run stage.

       This is the stage where the job is staged for execution and its DMA fence
       has been made public (i.e. it is accessible by userspace).

       This is the stage where we are in the DMA fence signalling critical
       section, hence we can't do any non-atomic allocations, since otherwise we
       could deadlock in MMU notifier callbacks for instance.

       This is the stage where the page table is actually modified. Hence, we
       can't acquire any locks that might be held elsewhere while doing
       non-atomic allocations. Also note that this is transitive, e.g. if you
       take lock A and somewhere else a lock B is taked while A is already held
       and we do non-atomic allocations while holding B, then A can't be held in
       the DMA fence signalling critical path either.

       It is also worth noting that this is the stage where we know the exact
       operations we have to execute based on the VM_BIND request from 
userspace.

       For instance, in the submit stage we may only know that userspace wants
       that we map a BO with a certain offset in the GPUs virtual address space
       at [0x0, 0x1000000]. What we don't know is what exact operations this 
does
       require, i.e. "What do we have to unmap first?", "Are there any
       overlapping mappings that we have to truncate?", etc.

       So, we have to consider this when we pre-allocate in the submit stage.

   (3) The cleanup stage.

       This is where the job has been signaled and hence left the DMA fence
       signalling critical section.

       In this stage the job is cleaned up, which includes freeing data that is
       not required anymore, such as PTEs and PDEs.

Reply via email to