Cc:ing tech-kern, to get wider feedback. Thread started here: http://mail-index.netbsd.org/source-changes-d/2011/08/21/msg003897.html
>>>>> "JM" == Jean-Yves Migeon <jeanyves.mig...@free.fr> writes: JM> On Mon, 22 Aug 2011 12:47:40 +0200, Manuel Bouyer wrote: >>> This is slightly more complicated than it appears. Some of the >>> "ops" in a per-cpu queue may have ordering dependencies with >>> other cpu queues, and I think this would be hard to express >>> trivially. (an example would be a pte update on one queue, and >>> reading the same pte read on another queue - these cases are >>> quite analogous (although completely unrelated) >> Hi, So I had a better look at this - implemented per-cpu queues and messed with locking a bit: >> read don't go through the xpq queue, don't they ? JM> Nope, PTE are directly obtained from the recursive mappings JM> (vtopte/kvtopte). Let's call this "out of band" reads. But see below for "in-band" reads. JM> Content is "obviously" only writable by hypervisor (so it can JM> keep control of his mapping alone). >> I think this is similar to a tlb flush but the other way round, I >> guess we could use a IPI for this too. JM> IIRC that's what the current native x86 code does: it uses an JM> IPI to signal other processors that a shootdown is necessary. Xen's TLB_FLUSH operation is synchronous, and doesn't require an IPI (within the domain), which makes the queue ordering even more important (to make sure that stale ptes are not reloaded before the per-cpu queue has made progress). Yes, we can implement a roundabout ipi driven queueflush + tlbflush scheme(described below), but that would be performance sensitive, and the basic issue won't go away, imho. Let's stick to the xpq ops for a second, ignoring "out-of-band" reads (for which I agree that your assertion, that locking needs to be done at a higher level, holds true). The question here, really is, what are the global ordering requirements of per-cpu memory op queues, given the following basic "ops": i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE) ii) read memory via: MMUEXT_PIN_L1_TABLE MMUEXT_PIN_L2_TABLE MMUEXT_PIN_L3_TABLE MMUEXT_PIN_L4_TABLE MMUEXT_UNPIN_TABLE MMUEXT_NEW_BASEPTR MMUEXT_TLB_FLUSH_LOCAL MMUEXT_INVLPG_LOCAL MMUEXT_TLB_FLUSH_MULTI MMUEXT_INVLPG_MULTI MMUEXT_TLB_FLUSH_ALL MMUEXT_INVLPG_ALL MMUEXT_FLUSH_CACHE MMUEXT_NEW_USER_BASEPTR (ie; anything that will cause the processor to re-read data updated via another cpu (via, for eg: pte update with i), above) There's two ways I can think of fixing this: a) *before* queueing a local read-op, synchronously flush queues on all other CPUs via ipis. This is slightly racy, but can be done, I think. An optimisation for invlpg could be to implement a scoreboard that watches mem. locations that have been queued for update on any cpu. Scan through the scoreboard for the memory range we're invlpg-ing. If it's not there, there's no need to flush any queues on other cpus. b) read-ops wait on a global condvar. If it's set, a write-op that needs flushing is pending. Wait (with optional timeout and ipi-"nudge") until the remote queue is flushed. When flushing a queue, send a cv_broadcast to any waiters. Option "b)" is slightly better than my current scheme which is to lock any and all mmu-ops and operate the queues synchronously (via XENDEBUG_SYNC). I cannot think of anything else, other than ad-hoc locking + queue flushing, which could be hard to maintain and debug in the long run. >>> I'm thinking that it might be easier and more justifiable to >>> nuke the current queue scheme and implement shadow page tables, >>> which would fit more naturally and efficiently with CAS pte >>> updates, etc. >> >> I'm not sure this would completely fis the issue: with shadow >> page tables you can't use a CAS to assure atomic operation with >> the hardware TLB, as this is, precisely, a shadow PT and not the >> one used by hardware. Definitely worth looking into, I imho. I'm not very comfortable with the queue based scheme for MP. the CAS doesn't provide any guarantees with the TLB on native h/w, afaict. If you do a CAS pte update, and the update succeeded, it's a good idea to invalidate + shootdown anyway (even on baremetal). Do let me know your thoughts, Cheers, -- Cherry