Hi Danilo,

On 1/28/2026 7:09 PM, Danilo Krummrich wrote:
> On Wed Jan 28, 2026 at 4:27 PM CET, Joel Fernandes wrote:
>> I will go over these concerns, just to clarify - do you mean forbidding 
>> *any* lock or do you mean only forbidding non-atomic locks? I believe we 
>> can avoid non-atomic locks completely - actually I just wrote a patch 
>> before I read this email to do just. If we are to forbid any locking at 
>> all, that might require some careful redesign to handle the above race 
>> afaics.
> 
> It's not about the locks themselves, sleeping locks are fine too.
Ah, so in your last email when you meant "non-atomic", you mean an allocation
that cause memory reclamation etc, right? I got confused by "non-atomic" because
I thought you were referring to acquiring a sleeping lock in a non-atomic
context (I also work on CPU scheduling/RCU, so the word atomic sometimes means
different things to me - my fault not yours :P).

I believe we may have to use "try lock" on a mutex if have to use these in the
future, in a path that cannot wait (such as a page fault handler), but yes I
agree with you we can use mutexes for these, with a combination of try_lock +
bottom half deferrals. additional comment [1].

Coming to the dma-fence deadlocks you mention, this sounds very similar to my
experiences with reclaim-deadlocks when I worked on the Ashmem Android driver.
Deja-vu :-D. The issue there was the memory shrinker would take a lock in the
ashmem driver during reclaim, which is a disaster if the lock was already held
and a memory allocation request triggered reclaim. I believe the DMA fence
usecase is also similar based on your description.

It's about
> holding locks that are held elsewhere when doing memory allocations that can
> call back into MMU notifiers or the shrinker.
> 
> I.e. if in the fence signalling critical path you wait for a mutex that is 
> held
> elsewhere while allocating memory and the memory allocation calls back into 
> the
> shrinker, you may end up waiting for your own DMA fence to be signaled, which
> causes a deadlock.

Got it, I will send the next day or so studying the DMA fence architecture but I
mostly got the idea now. We need to be careful with reclaim locking as you
stressed. I will analyze all the requirements to properly address this. I will
reach out if I have any questions. Thanks for sharing your knowledge in this!

--
Joel Fernandes

[1]
I can confirm for completeness, that both Nouveau and OpenRM use mutexes for
PT/VMM related locking. In interrupt contexts, OpenRM does a "try lock" AFAICS
on its mutex. This is similar to how Linux kernel mm page fault handling
acquires mmap_sem (via try-locking).
The linux kernel does have per-PT spinlocks to handle the "2 paths try to
install a PDE/PTE race", but I don't think we need that at the moment for our
usecases as we can keep it simple and rely on the VMM mutex, we can perhaps add
that in later if needed (or use more finer grained block-level locking), but let
me know if anyone disagrees with that.

Reply via email to