On Mon, Aug 29, 2011 at 03:03:37PM +0200, Cherry G. Mathew wrote: > Hi Manuel, > > >>>>> "Manuel" == Manuel Bouyer <bou...@antioche.eu.org> writes: > > > [...] > > >> > >> Xen's TLB_FLUSH operation is synchronous, and doesn't require an > >> IPI (within the domain), which makes the queue ordering even more > >> important (to make sure that stale ptes are not reloaded before > >> the per-cpu queue has made progress). Yes, we can implement a > >> roundabout ipi driven queueflush + tlbflush scheme(described > >> below), but that would be performance sensitive, and the basic > >> issue won't go away, imho. > >> > >> Let's stick to the xpq ops for a second, ignoring "out-of-band" > >> reads (for which I agree that your assertion, that locking needs > >> to be done at a higher level, holds true). > >> > >> The question here, really is, what are the global ordering > >> requirements of per-cpu memory op queues, given the following > >> basic "ops": > >> > >> i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE) > >> ii) read memory via: MMUEXT_PIN_L1_TABLE MMUEXT_PIN_L2_TABLE > >> MMUEXT_PIN_L3_TABLE MMUEXT_PIN_L4_TABLE MMUEXT_UNPIN_TABLE > > Manuel> This is when adding/removing a page table from a pmap. When > Manuel> this occurs, the pmap is locked, isn't it ? > > >> MMUEXT_NEW_BASEPTR MMUEXT_NEW_USER_BASEPTR > > Manuel> This is a context switch > > >> MMUEXT_TLB_FLUSH_LOCAL MMUEXT_INVLPG_LOCAL MMUEXT_TLB_FLUSH_MULTI > >> MMUEXT_INVLPG_MULTI MMUEXT_TLB_FLUSH_ALL MMUEXT_INVLPG_ALL > >> MMUEXT_FLUSH_CACHE > > Manuel> This may, or may not, cause a read. This usually happens > Manuel> after updating the pmap, and I guess this also happens with > Manuel> the pmap locked (I have not carefully checked). > > Manuel> So couldn't we just use the pmap lock for this ? I suspect > Manuel> the same lock will also be needed for out of band reads at > Manuel> some point (right now it's protected by splvm()). > > I'm a bit confused now - are we assuming that the pmap lock protects the > (pte update op queue-push(es) + pmap_pte_flush()) as a single atomic > operation (ie; no invlpg/tlbflush or out-of-band-read can occur between > the update(s) and the pmap_pte_flush()) ?
out of band reads can always occurs, there's no lock which can protect against this. > > If so, I think I've slightly misunderstood the scope of the mmu queue > design - I assumed that the queue is longer-lived, and the sync points > (for the queue flush) can span across pmap locking - a sort of lazy pte > update, with the queue being flushed at out-of-band or in-band read > time ( I guess that won't work though - how does one know when the > hardware walks the page table ?) . It seems that the queue is meant for > pte updates in loops, for eg:, quickly followed by a flush. Is this > correct ? it was not explicitely designed this way, but I think that's how things are in practice, yes. Usage would need to be checked, though. There may be some special case in the kernel pmap area ... > > If so, there's just one hazard afaict - the synchronous TLB_FLUSH_MULTI > could beat the race between the queue update and the queue flush via > pmap_tlb_shootnow() (see pmap_tlb.c on the cherry-xenmp branch, and *if* > other CPUs reload their TLBs before the flush, they'll have stale info. > > So the important question (rmind@ ?) is, is pmap_tlb_shootnow() > guaranteed to be always called with the pmap lock held ? I don't think so; but I also don't think that's the problem. There shouldn't be more race with this than on native hardware. > > In real life, I removed the global xpq_queue_lock() and the pmap was > falling apart. So a bit of debugging ahead. Hum, on seocnd though, something more may be needed to protect the queue. The pmap lock won't probably work for pmap_kernel ... > > >> [...] > >> > >> >>> I'm thinking that it might be easier and more justifiable to > >> >>> nuke the current queue scheme and implement shadow page > >> tables, >>> which would fit more naturally and efficiently with > >> CAS pte >>> updates, etc. > >> >> > >> >> I'm not sure this would completely fis the issue: with shadow > >> >> page tables you can't use a CAS to assure atomic operation > >> with >> the hardware TLB, as this is, precisely, a shadow PT and > >> not the >> one used by hardware. > >> > >> Definitely worth looking into, I imho. I'm not very comfortable > >> with the queue based scheme for MP. > >> > >> the CAS doesn't provide any guarantees with the TLB on native > >> h/w, afaict. > > Manuel> What makes you think so ? I think the hw TLB also does CAS > Manuel> to update referenced and dirty bits in the PTE, otherwise we > Manuel> couldn't rely on these bits; this would be bad especially > Manuel> for the dirty bit. > > Yes, I missed that one (which is much of the point of the CAS in the > first place!), you're right. > > >> If you do a CAS pte update, and the update succeeded, it's a good > >> idea to invalidate + shootdown anyway (even on baremetal). > > Manuel> Yes, of course inval is needed after updating the PTE. But > Manuel> using a true CAS is important to get the refereced and dirty > Manuel> bits right. > > I haven't looked into this in detail, but I thought that xen disconnects > the shadow (presumably with correct update/dirty bits) and flushes the > TLB when a write to the shadow occurs ? In which case these bits will be > in sync until the next h/w access ? I'm speculating, I haven't looked at > the xen code for this yet. It's xen is emulating atomic operations on the shadow then that's probably the way to go. -- Manuel Bouyer <bou...@antioche.eu.org> NetBSD: 26 ans d'experience feront toujours la difference --