On 3/6/26 14:03, Danilo Krummrich wrote: > On Fri Mar 6, 2026 at 1:36 PM CET, Philipp Stanner wrote: >> On Fri, 2026-03-06 at 13:31 +0100, Christian König wrote: >>> All fences must always signal because the HW operation must always complete >>> or be terminated by a timeout. >>> >>> If a fence signals only because it runs out of scope than that means that >>> you >>> have a huge potential for data corruption and that is even worse than not >>> signaling a fence. > > If that happens, it is a functional bug, the potential data corruption is only > within a separate memory object, e.g. GEM etc., no? I.e. it may fault the GPU, > but it does not fault the kernel.
That makes assumption that DMA operations are protected by an MMU which provides virtual memory. But the VM functionality of modern GPUs are the exception and not the norm for devices using DMA fences. >>> In other words not signaling a fence can leave the system in a deadlock >>> state, but signaling it incorrectly usually results in random data >>> corruption. > > Well, not signaling it results in a potential deadlock of the whole kernel, > whereas wrongly signaling it is "only" a functional bug. No, that results in random memory corruption. Which is easily a magnitude worse than just a kernel deadlock. When have seen such bugs numerous times with suspend and resume on laptops in different subsystems, e.g. not only GPU. And I'm absolutely clearly rejecting any attempt to signal DMA fences when an object runs out of scope because of that experience. >> It all stands and falls with the question whether a fence can drop by >> accident in Rust, or if it will only ever drop when the hw-ring is >> closed. >> >> What do you believe is the right thing to do when a driver unloads? > > The fence has to be signaled -- ideally after shutting down all queues, but it > has to be signaled. Yeah well this shutting down all queues (and making sure that no write operation is pending in caches etc...) is what people usually don't get right. What you can to is things like setting your timeout to zero and immediately causing terminating the HW operation and resetting the device. This will then use the same code path as the mandatory timeout functionality for DMA operations and that usually works reliable. >> Ideally we could design it in a way that the driver closes its rings, >> the pending fences drop and get signaled with ECANCELED. >> >> Your concern seems to be a driver by accident droping a fence while the >> hardware is still processing the associated job. > > I'm not concerned about the "driver drops fence by accident" case, as it is > less > problematic than the "driver forgets to signal the fence" case. One is a logic > bug, whereas the other can deadlock the kernel, i.e. it is unsafe in terms of > Rust. > > (Technically, there are subsequent problems to solve, as core::mem::forget() > is > safe and would cause the same problem. However, this is not new, it applies to > lock guards in general. We can catch such things with klint though.) > > Ultimately, a DMA fence (that has been exposed to the outside world) is > technically equivalent to a lock guard. +1 Yes, exactly that. Regards, Christian. > >> (how's that dangerous, though? Shouldn't parties waiting for the fence >> detect the error? ECANCELED ⇒ you must not access the associated >> memory)
