On Wed, Feb 21, 2024 at 9:44 PM Samuel Thibault <samuel.thiba...@gnu.org> wrote: > Damien Zammit, le mer. 21 févr. 2024 13:45:59 +0000, a ecrit: > > Authored-by: Sergey Bugaev <buga...@gmail.com>
> IIRC you got to see which kind of deadlock we are protecting from here? > I'd be very useful to document it here, so next people reading this get > to know. I didn't document the exact deadlock in the code, since the deadlock is only a manifestation of the actual issue, the actual issue being that the entry was being modified by other threads while accessed inside vm_fault_wire(). This is -- for one thing, technically undefined behavior -- but also will _of course_ break any kind of logic that works with the entry's offset / address range. It could be worse than a deadlock. So what I wanted to emphasize in the code is that we should make sure the entry is not concurrently read and modified, and I was going to describe the actual deadlock we've witnessed in the commit message. So let's go with that; I'll re-send this patch myself with a detailed commit message and reworked comment. > > [...] We trust that the > > * kernel threads are well-behaved, and therefore will > > * not do anything destructive to this region of the map > > * while we have it unlocked. > > Are we not here precisely in the case that they are not well-behaved? No. The threads are well-behaved, but turns out that's not quite enough, because, as the new comment says, in the face of clipping and coalescing entries, even well-behaved, well-intentioned operations on *adjacent* VM regions can affect the same entry -- in fact, the entry can get freed altogether if it gets coalesced into the previous entry, why would create a use-after-free. > The comment probably deserves an update, to make the situation cler. I thought it was clear enough, but evidently it wasn't. I'll amend it. > > + * Once we unlock the map, even well-intentioned operations > > + * on adjacent VM regions can end up affecting our entry, > > + * due to clipping and coalescing entries. So, make a > > + * temporary copy of the entry, and pass that to vm_fault_wire() > > + * instead of the original. > > We need to be very careful with such a thing. If it's a copy that is > given to further function calls, we need to make sure that they won't > try to modify it, i.e. add const qualifiers on the entry parameter all > along to vm_fault_wire and its callees being passed entry. That's a good idea, yes. > > That being said, this doesn't seem completely safe to me: in the end > vm_fault_wire_fast uses the object field, are we sure that the object > won't disappear? Or are we sure it won't because "kernel threads are > well-behaved"? (but I don't really know how we are sure of this, so this > probably also deserves documenting). We are sure because kernel threads (or rather, any threads running in the kernel) are well-behaved, yes. We trust that they won't unmap the mapping while it's being wired, and the mapping (the entry) keeps the object alive. But... now that I think of it, we do also coalesce objects. So we better keep a second reference on the object over vm_fault_wire indeed. > > + * > > * HACK HACK HACK HACK > > */ > > if (vm_map_pmap(map) == kernel_pmap) { > > + /* > > + * TODO: Support wiring more than one entry > > + * in the kernel map at a time. > > + */ > > + assert(start->vme_next == end); > > Are we sure that this currently is never to happen? Not really. I don't expect the kernel to do this anywhere, but I don't know for sure. Please try running some complex workloads with this patch, and see if you catch this being not so? There isn't anything fundamentally incompatible with supporting multi-entry wiring here, I just didn't want to do an alloca. > > + entry_copy = *start; > > vm_map_unlock(map); /* trust me ... */ > > - } else { > > - vm_map_lock_set_recursive(map); > > - vm_map_lock_write_to_read(map); > > + vm_fault_wire(map, &entry_copy); > > This is also assuming that for non-kernel_pmap there is just one entry > for which to call vm_fault_wire, while userland can very well ask for > wiring a wide range of memory, spanning various objects and whatnot. No, this is still the kernel map branch (note that the "else" is being deleted by the patch). > AIUI you'd rather want the converse: inline the code for the kernel_pmap > case, which is much more trivial since start->vme_next == end and thus > no for loop to perform, just call vm_fault_wire and the locking stuff > around, and be done. That's exactly what this patch does, yes. Sergey