On Fri, Jan 17, 2014 at 07:45:15PM -0800, Roland McGrath wrote: > > This is why I was insisting on passing *memory* through IPC. > > It's not at all clear that makes any kind of sense, unless you mean > something I haven't imagined. Can you be specific about exactly what the > interface (say, a well-commented MiG .defs fragment) you have in mind would > look like? > > If it's an RPC that passes out of line memory, that (IIRC) always has > virtual-copy semantics, never page-sharing semantics. So it would be > fundamentally the wrong model for matching up with other futex calls (from > the same task or others) to synchronize on a shared int, which is what the > futex semantics is all about.
That's right, IPC can only copy private memory, not make it shared. So technically, not through IPC, but through the VM system. > What I always anticipated for a Machish futex interface was vm_futex_* > calls, which is to say, technically RPCs to the task port (which need not > be the task port of the caller), passing an address as an integer literal > just as calls like vm_write do (and each compare&exchange value as a datum, > i.e. an integer literal, just as vm_write takes a datum of byte-array type, > with semantics unchanged by whether that's inline or out of line memory). That's more what I had in mind too. > The task port and address serve as a proxy by which the kernel finds the > memory object and offset, and the actual synchronization semantics are > about that offset in that memory object and the contents of the word at > that location. (Like all such calls, they would likely be optimized > especially for the case of calls to task-self and probably even to the > extent of having a bespoke syscall for the most-optimized case, as with > vm_allocate. But that's later optimization.) > > Given the specified usage patterns for the futex operations, it might be > reasonable enough to implement those semantics solely by translating to a > physical page, including blocking to fault one in, and then associating the > wait queues with offsets into the physical page rather than the memory > object abstraction. (Both a waiter and a waker will have just faulted in > the page before making the futex call anyway.) But note that the semantics > require that if a waiter was blocked when the virtual page got paged out, > then when you page it back in inside vm_futex_wake, that old waiter must > get woken. I don't know the kernel's VM internals much at all, but I > suspect that all tasks mapping a shared page do not get eagerly updated > when the memory object page is paged in to service a page fault in some > other task, but rather service minor faults on demand (i.e. later) to > rediscover the new assocation between the virtual page and the new physical > page incidentally brought in by someone else's page fault a little earlier. > Since you need to track waiters at the memory object level while their page > is nonresident anyway, it probably makes sense just to hang the {offset => > wait queue} table off the memory object and always use that. At least, > that seems like the approach for the first version that ensures correctness > in all the corners of the semantics. It can get fancier as needed in later > optimizations. When it comes to optimizing it, a fairly deep understanding > of the Linux futex implementation (which I don't have off hand, though I > have read it in the past) is probably instructive. Locking physical pages could be used for denial of service, i.e. a user may implicitely starve the system of wired memory, unless they're accounted as such, but then users might be "randomly" unable to use mutexes. Trying to cope with object/offset to pages associations would imply some container very similar to what has been considered until now for the regular case (that is, a hash table or tree for all shared futexes) so that futexes can immediately be reassociated to pages after faulting them in. So I expect we have to use VM objects. What I had in mind is already partially explained in previous mails but I still didn't take the time to get a clear view of every use case so it's probably incomplete. But it would start with a union of either (task translated to map, address) or (object, offset), depending on the futex type (private, shared, respectively). Problems I can see with that approach are : - do we have to check that shared futexes refer to shareable memory ? - if so, how to make that check reliably ? - what happens when unmapping a futex ? - does copy-on-right have any effect on a private futex - if implemented as a (map, address) pair, I imagine it wouldn't, but is it true ? These are the kind of things I was hoping to discuss with this patch. -- Richard Braun