Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 02:09:41AM +0100, Andrea Arcangeli wrote: > On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: > > Sorry, I realise I still didn't get this through my head yet (and also > > have not seen your patch recently). So I don't know exactly what you > > are doing... > > The last version was posted here: > > http://marc.info/?l=kvm-devel=120321732521533=2 > > > But why does _anybody_ (why does Christoph's patches) need to invalidate > > when they are going to be more permissive? This should be done lazily by > > the driver, I would have thought. > > This can be done lazily by the driver yes. The place where I've an > invalidate_pages in mprotect however can also become less permissive. That's OK, because we have to flush tlbs there too. > It's simpler to invalidate always and it's not guaranteed the > secondary mmu page fault is capable of refreshing the spte across a > writeprotect fault. I think we just have to make sure that it _can_ do writeprotect faults. AFAIKS, that will be possible if the driver registers a .page_mkwrite handler (actually not quite -- page_mkwrite is fairly crap, so I have a patch to merge it together with .fault so we get address information as well). Anyway, I really think we should do it that way. > In the future this can be changed to > mprotect_pages though, so no page fault will happen in the secondary > mmu. Possibly, but hopefully not needed for performance. Let's wait and see. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 05:40:50PM -0600, Jack Steiner wrote: > On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: > > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > > > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > > > > anything when changing the pte to be _more_ permissive, and I don't > > > > > > Note that in my patch the invalidate_pages in mprotect can be > > > trivially switched to a mprotect_pages with proper params. This will > > > prevent page faults completely in the secondary MMU (there will only > > > be tlb misses after the tlb flush just like for the core linux pte), > > > and it'll allow all the secondary MMU pte blocks (512/1024 at time > > > with my PT lock design) to be updated to have proper permissions > > > matching the core linux pte. > > > > Sorry, I realise I still didn't get this through my head yet (and also > > have not seen your patch recently). So I don't know exactly what you > > are doing... > > > > But why does _anybody_ (why does Christoph's patches) need to invalidate > > when they are going to be more permissive? This should be done lazily by > > the driver, I would have thought. > > > Agree. Although for most real applications, the performance difference > is probably negligible. But importantly, doing it that way means you share test coverage with the CPU TLB flushing code, and you don't introduce a new concept to the VM. So, it _has_ to be lazy flushing, IMO (as there doesn't seem to be a good reason otherwise). mprotect shouldn't really be a special case, because it still has to flush the CPU tlbs as well when restricting access. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 05:40:50PM -0600, Jack Steiner wrote: On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: anything when changing the pte to be _more_ permissive, and I don't Note that in my patch the invalidate_pages in mprotect can be trivially switched to a mprotect_pages with proper params. This will prevent page faults completely in the secondary MMU (there will only be tlb misses after the tlb flush just like for the core linux pte), and it'll allow all the secondary MMU pte blocks (512/1024 at time with my PT lock design) to be updated to have proper permissions matching the core linux pte. Sorry, I realise I still didn't get this through my head yet (and also have not seen your patch recently). So I don't know exactly what you are doing... But why does _anybody_ (why does Christoph's patches) need to invalidate when they are going to be more permissive? This should be done lazily by the driver, I would have thought. Agree. Although for most real applications, the performance difference is probably negligible. But importantly, doing it that way means you share test coverage with the CPU TLB flushing code, and you don't introduce a new concept to the VM. So, it _has_ to be lazy flushing, IMO (as there doesn't seem to be a good reason otherwise). mprotect shouldn't really be a special case, because it still has to flush the CPU tlbs as well when restricting access. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 02:09:41AM +0100, Andrea Arcangeli wrote: On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: Sorry, I realise I still didn't get this through my head yet (and also have not seen your patch recently). So I don't know exactly what you are doing... The last version was posted here: http://marc.info/?l=kvm-develm=120321732521533w=2 But why does _anybody_ (why does Christoph's patches) need to invalidate when they are going to be more permissive? This should be done lazily by the driver, I would have thought. This can be done lazily by the driver yes. The place where I've an invalidate_pages in mprotect however can also become less permissive. That's OK, because we have to flush tlbs there too. It's simpler to invalidate always and it's not guaranteed the secondary mmu page fault is capable of refreshing the spte across a writeprotect fault. I think we just have to make sure that it _can_ do writeprotect faults. AFAIKS, that will be possible if the driver registers a .page_mkwrite handler (actually not quite -- page_mkwrite is fairly crap, so I have a patch to merge it together with .fault so we get address information as well). Anyway, I really think we should do it that way. In the future this can be changed to mprotect_pages though, so no page fault will happen in the secondary mmu. Possibly, but hopefully not needed for performance. Let's wait and see. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > > > anything when changing the pte to be _more_ permissive, and I don't > > > > Note that in my patch the invalidate_pages in mprotect can be > > trivially switched to a mprotect_pages with proper params. This will > > prevent page faults completely in the secondary MMU (there will only > > be tlb misses after the tlb flush just like for the core linux pte), > > and it'll allow all the secondary MMU pte blocks (512/1024 at time > > with my PT lock design) to be updated to have proper permissions > > matching the core linux pte. > > Sorry, I realise I still didn't get this through my head yet (and also > have not seen your patch recently). So I don't know exactly what you > are doing... > > But why does _anybody_ (why does Christoph's patches) need to invalidate > when they are going to be more permissive? This should be done lazily by > the driver, I would have thought. I don't believe it should, but it probably does right now. I do know the case where a write fault where there is no need for a COW does not call out on the PTE change. I see no reason the others should not handle this as well. Just off the top of my head, I can only think of the mprotect case needing to special case the more permissive state and I don't think that changes PTEs at all, merely updates the VMA. Thanks, Robin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 01:52:06AM +0100, Andrea Arcangeli wrote: > On Wed, Feb 20, 2008 at 12:04:27AM +0100, Nick Piggin wrote: > > OK (thanks to Robin as well). Now I understand why you are using it, > > but I don't understand why you don't defer new TLBs after the point > > where the linux pte changes. If you can do that, then you look and > > act much more like a TLB from the point of view of the Linux vm. > > Christoph was forced to put the invalidate_range callback _after_ > dropping the PT lock because xpmem has to wait I/O there. But > invalidate_range is called after freeing the VM reference on the pages > so then GRU needed a _range_begin too because GRU has to flush the tlb > before the VM reference on the page is released (xpmem and KVM pin the > pages mapped by the secondary mmu, GRU doesn't). So then > invalidate_range was renamed to invalidate_range_end. Currently, xpmem blocks faults for the range specified at the _begin callout, then shoots down remote TLBs and does the put_page for all the pages in the specified range. The _end callout merely removes the block. We do not do any wait for I/O. By the time we return from the _begin callout, all activity by the remotes is stopped, pages are dereferenced, and future faults are blocked until released by the _end callout. Thanks, Robin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: > Sorry, I realise I still didn't get this through my head yet (and also > have not seen your patch recently). So I don't know exactly what you > are doing... The last version was posted here: http://marc.info/?l=kvm-devel=120321732521533=2 > But why does _anybody_ (why does Christoph's patches) need to invalidate > when they are going to be more permissive? This should be done lazily by > the driver, I would have thought. This can be done lazily by the driver yes. The place where I've an invalidate_pages in mprotect however can also become less permissive. It's simpler to invalidate always and it's not guaranteed the secondary mmu page fault is capable of refreshing the spte across a writeprotect fault. In the future this can be changed to mprotect_pages though, so no page fault will happen in the secondary mmu. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 12:04:27AM +0100, Nick Piggin wrote: > On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote: > > > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > > > > understand the need for invalidate_begin/invalidate_end pairs at all. > > > > > > The need of the pairs is crystal clear to me: range_begin is needed > > > for GRU _but_only_if_ range_end is called after releasing the > > > reference that the VM holds on the page. _begin will flush the GRU tlb > > > and at the same time it will take a mutex that will block further GRU > > > tlb-miss-interrupts (no idea how they manange those nightmare locking, > > > I didn't even try to add more locking to KVM and I get away with the > > > fact KVM takes the pin on the page itself). > > > > As it turns out, no actual mutex is required. _begin_ simply increments a > > count of active range invalidates, _end_ decrements the count. New TLB > > dropins are deferred while range callouts are active. > > > > This would appear to be racy but the GRU has special hardware that > > simplifies locking. When the GRU sees a TLB invalidate, all outstanding > > misses & potentially inflight TLB dropins are marked by the GRU with a > > "kill" bit. When the dropin finally occurs, the dropin is ignored & the > > instruction is simply restarted. The instruction will fault again & the TLB > > dropin will be repeated. This is optimized for the case where invalidates > > are rare - true for users of the GRU. > > OK (thanks to Robin as well). Now I understand why you are using it, > but I don't understand why you don't defer new TLBs after the point > where the linux pte changes. If you can do that, then you look and > act much more like a TLB from the point of view of the Linux vm. Christoph was forced to put the invalidate_range callback _after_ dropping the PT lock because xpmem has to wait I/O there. But invalidate_range is called after freeing the VM reference on the pages so then GRU needed a _range_begin too because GRU has to flush the tlb before the VM reference on the page is released (xpmem and KVM pin the pages mapped by the secondary mmu, GRU doesn't). So then invalidate_range was renamed to invalidate_range_end. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 11:59:23PM +0100, Nick Piggin wrote: > That's why I don't understand the need for the pairs: it should be > done like this. Yes, except it can't be done like this for xpmem. > OK, I didn't see the invalidate_pages call... See the last patch I posted to Andrew, you've probably looked at the old patches, the old patches didn't work for GRU and didn't work for xpmem and they weren't optimized to cluster the invalidates for each 4k-large-pte. > I thought that could be used by a non-sleeping user (not intending > to try supporting sleeping users). If it is useless then it should > go away (BTW. I didn't see your recent patch, some of my confusion > I think stems from Christoph's novel way of merging and splitting > patches). I kept improving my patch in case the VM maintainers would consider xpmem requirements not workable from a linux-VM point of view, and they preferred to have something obviously safe, strightforward and non intrusive, despite it doesn't support the only sleeping user out there I know of (xpmem). My patch supports KVM and GRU (and any other not sleeping user). > > No idea why xpmem needs range_begin, I perfectly understand why GRU > > needs _begin with Chrisotph's patch (gru lacks the page pin) but I > > dunno why xpmem needs range_begin (xpmem has the page pin so I also > > think it could avoid using range_begin). Still to support GRU you need > > both to call invalidate_range in places that can sleep and you need > > the external rmap notifier. The moment you add xpmem into the equation > > your and my clean patches become Christoph's one... > > Sorry, I kind of didn't have time to follow the conversation so well > before; are there patches posted for gru and/or xpmem? There's some xpmem code posted but the posted one isn't using the mmu notifiers yet. GRU code may be available from Jack. I only know for sure their requirements in terms of mmu notifiers. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > > > anything when changing the pte to be _more_ permissive, and I don't > > > > Note that in my patch the invalidate_pages in mprotect can be > > trivially switched to a mprotect_pages with proper params. This will > > prevent page faults completely in the secondary MMU (there will only > > be tlb misses after the tlb flush just like for the core linux pte), > > and it'll allow all the secondary MMU pte blocks (512/1024 at time > > with my PT lock design) to be updated to have proper permissions > > matching the core linux pte. > > Sorry, I realise I still didn't get this through my head yet (and also > have not seen your patch recently). So I don't know exactly what you > are doing... > > But why does _anybody_ (why does Christoph's patches) need to invalidate > when they are going to be more permissive? This should be done lazily by > the driver, I would have thought. Agree. Although for most real applications, the performance difference is probably negligible. --- jack -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > > anything when changing the pte to be _more_ permissive, and I don't > > Note that in my patch the invalidate_pages in mprotect can be > trivially switched to a mprotect_pages with proper params. This will > prevent page faults completely in the secondary MMU (there will only > be tlb misses after the tlb flush just like for the core linux pte), > and it'll allow all the secondary MMU pte blocks (512/1024 at time > with my PT lock design) to be updated to have proper permissions > matching the core linux pte. Sorry, I realise I still didn't get this through my head yet (and also have not seen your patch recently). So I don't know exactly what you are doing... But why does _anybody_ (why does Christoph's patches) need to invalidate when they are going to be more permissive? This should be done lazily by the driver, I would have thought. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote: > > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > > > understand the need for invalidate_begin/invalidate_end pairs at all. > > > > The need of the pairs is crystal clear to me: range_begin is needed > > for GRU _but_only_if_ range_end is called after releasing the > > reference that the VM holds on the page. _begin will flush the GRU tlb > > and at the same time it will take a mutex that will block further GRU > > tlb-miss-interrupts (no idea how they manange those nightmare locking, > > I didn't even try to add more locking to KVM and I get away with the > > fact KVM takes the pin on the page itself). > > As it turns out, no actual mutex is required. _begin_ simply increments a > count of active range invalidates, _end_ decrements the count. New TLB > dropins are deferred while range callouts are active. > > This would appear to be racy but the GRU has special hardware that > simplifies locking. When the GRU sees a TLB invalidate, all outstanding > misses & potentially inflight TLB dropins are marked by the GRU with a > "kill" bit. When the dropin finally occurs, the dropin is ignored & the > instruction is simply restarted. The instruction will fault again & the TLB > dropin will be repeated. This is optimized for the case where invalidates > are rare - true for users of the GRU. OK (thanks to Robin as well). Now I understand why you are using it, but I don't understand why you don't defer new TLBs after the point where the linux pte changes. If you can do that, then you look and act much more like a TLB from the point of view of the Linux vm. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > > are rather similar. However I have tried to make a point of minimising the > > impact the the core mm/. I don't see why we need to invalidate or flush > > I also tried hard to minimise the impact of the core mm/, I also > argued with Christoph that cluttering mm/ wasn't a good idea for > things like age_page that could be a 1 liner change instead of a > multiple-liner change, without any loss of flexibility or readability. > > > anything when changing the pte to be _more_ permissive, and I don't > > Note that in my patch the invalidate_pages in mprotect can be > trivially switched to a mprotect_pages with proper params. This will > prevent page faults completely in the secondary MMU (there will only > be tlb misses after the tlb flush just like for the core linux pte), > and it'll allow all the secondary MMU pte blocks (512/1024 at time > with my PT lock design) to be updated to have proper permissions > matching the core linux pte. > > > understand the need for invalidate_begin/invalidate_end pairs at all. > > The need of the pairs is crystal clear to me: range_begin is needed > for GRU _but_only_if_ range_end is called after releasing the > reference that the VM holds on the page. _begin will flush the GRU tlb > and at the same time it will take a mutex that will block further GRU > tlb-miss-interrupts (no idea how they manange those nightmare locking, > I didn't even try to add more locking to KVM and I get away with the > fact KVM takes the pin on the page itself). > > My patch calls invalidate_page/pages before the reference is released > on the page, so GRU will work fine despite lack of > range_begin. Furthermore with my patch GRU will be auto-serialized by > the PT lock w/o the need of any additional locking. That's why I don't understand the need for the pairs: it should be done like this. > > What I have done is basically create it so that the notifiers get called > > basically in the same place as the normal TLB flushing is done, and nowhere > > else. > > That was one of my objectives too. > > > I also wanted to avoid calling notifier code from inside eg. hardware TLB > > or pte manipulation primitives. These things are already pretty well > > spaghetti, so I'd like to just place them right where needed first... I > > think eventually it will need a bit of a rethink to make it more consistent > > and more general. But I prefer to do put them in the caller for the moment. > > Your patch should also work for KVM but it's suboptimal, my patch can > be orders of magnitude more efficient for GRU thanks to the > invalidate_pages optimization. Christoph complained about having to > call one method per pte. OK, I didn't see the invalidate_pages call... > And adding invalidate_range is useless unless you fully support > xpmem. You're calling invalidate_range in places that can't sleep... I thought that could be used by a non-sleeping user (not intending to try supporting sleeping users). If it is useless then it should go away (BTW. I didn't see your recent patch, some of my confusion I think stems from Christoph's novel way of merging and splitting patches). > No idea why xpmem needs range_begin, I perfectly understand why GRU > needs _begin with Chrisotph's patch (gru lacks the page pin) but I > dunno why xpmem needs range_begin (xpmem has the page pin so I also > think it could avoid using range_begin). Still to support GRU you need > both to call invalidate_range in places that can sleep and you need > the external rmap notifier. The moment you add xpmem into the equation > your and my clean patches become Christoph's one... Sorry, I kind of didn't have time to follow the conversation so well before; are there patches posted for gru and/or xpmem? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
> On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > > understand the need for invalidate_begin/invalidate_end pairs at all. > > The need of the pairs is crystal clear to me: range_begin is needed > for GRU _but_only_if_ range_end is called after releasing the > reference that the VM holds on the page. _begin will flush the GRU tlb > and at the same time it will take a mutex that will block further GRU > tlb-miss-interrupts (no idea how they manange those nightmare locking, > I didn't even try to add more locking to KVM and I get away with the > fact KVM takes the pin on the page itself). As it turns out, no actual mutex is required. _begin_ simply increments a count of active range invalidates, _end_ decrements the count. New TLB dropins are deferred while range callouts are active. This would appear to be racy but the GRU has special hardware that simplifies locking. When the GRU sees a TLB invalidate, all outstanding misses & potentially inflight TLB dropins are marked by the GRU with a "kill" bit. When the dropin finally occurs, the dropin is ignored & the instruction is simply restarted. The instruction will fault again & the TLB dropin will be repeated. This is optimized for the case where invalidates are rare - true for users of the GRU. In general, though, I agree. Most users of mmu_notifiers would likely required a mutex or something equivalent. --- jack -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > are rather similar. However I have tried to make a point of minimising the > impact the the core mm/. I don't see why we need to invalidate or flush I also tried hard to minimise the impact of the core mm/, I also argued with Christoph that cluttering mm/ wasn't a good idea for things like age_page that could be a 1 liner change instead of a multiple-liner change, without any loss of flexibility or readability. > anything when changing the pte to be _more_ permissive, and I don't Note that in my patch the invalidate_pages in mprotect can be trivially switched to a mprotect_pages with proper params. This will prevent page faults completely in the secondary MMU (there will only be tlb misses after the tlb flush just like for the core linux pte), and it'll allow all the secondary MMU pte blocks (512/1024 at time with my PT lock design) to be updated to have proper permissions matching the core linux pte. > understand the need for invalidate_begin/invalidate_end pairs at all. The need of the pairs is crystal clear to me: range_begin is needed for GRU _but_only_if_ range_end is called after releasing the reference that the VM holds on the page. _begin will flush the GRU tlb and at the same time it will take a mutex that will block further GRU tlb-miss-interrupts (no idea how they manange those nightmare locking, I didn't even try to add more locking to KVM and I get away with the fact KVM takes the pin on the page itself). My patch calls invalidate_page/pages before the reference is released on the page, so GRU will work fine despite lack of range_begin. Furthermore with my patch GRU will be auto-serialized by the PT lock w/o the need of any additional locking. > What I have done is basically create it so that the notifiers get called > basically in the same place as the normal TLB flushing is done, and nowhere > else. That was one of my objectives too. > I also wanted to avoid calling notifier code from inside eg. hardware TLB > or pte manipulation primitives. These things are already pretty well > spaghetti, so I'd like to just place them right where needed first... I > think eventually it will need a bit of a rethink to make it more consistent > and more general. But I prefer to do put them in the caller for the moment. Your patch should also work for KVM but it's suboptimal, my patch can be orders of magnitude more efficient for GRU thanks to the invalidate_pages optimization. Christoph complained about having to call one method per pte. And adding invalidate_range is useless unless you fully support xpmem. You're calling invalidate_range in places that can't sleep... No idea why xpmem needs range_begin, I perfectly understand why GRU needs _begin with Chrisotph's patch (gru lacks the page pin) but I dunno why xpmem needs range_begin (xpmem has the page pin so I also think it could avoid using range_begin). Still to support GRU you need both to call invalidate_range in places that can sleep and you need the external rmap notifier. The moment you add xpmem into the equation your and my clean patches become Christoph's one... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > So I implemented mmu notifiers slightly differently. Andrea's mmu notifiers > are rather similar. However I have tried to make a point of minimising the > impact the the core mm/. I don't see why we need to invalidate or flush > anything when changing the pte to be _more_ permissive, and I don't > understand the need for invalidate_begin/invalidate_end pairs at all. > What I have done is basically create it so that the notifiers get called > basically in the same place as the normal TLB flushing is done, and nowhere > else. Because XPMEM needs to be able to sleep during its callout. For that, we need to move this outside of the page table lock and suddenly we need the begin/end pair again. There was considerable discussion about this exact point numerous times. We tried to develop the most inclusive design possible. Our design would even be extendable to IB, assuming they made some very disruptive changes to their MPI and communication libraries. IB would suffer the same problems XPMEM does in that the TLB entries need to be removed on a remote host which is operating completely independently. Thanks, Robin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: So I implemented mmu notifiers slightly differently. Andrea's mmu notifiers are rather similar. However I have tried to make a point of minimising the impact the the core mm/. I don't see why we need to invalidate or flush anything when changing the pte to be _more_ permissive, and I don't understand the need for invalidate_begin/invalidate_end pairs at all. What I have done is basically create it so that the notifiers get called basically in the same place as the normal TLB flushing is done, and nowhere else. Because XPMEM needs to be able to sleep during its callout. For that, we need to move this outside of the page table lock and suddenly we need the begin/end pair again. There was considerable discussion about this exact point numerous times. We tried to develop the most inclusive design possible. Our design would even be extendable to IB, assuming they made some very disruptive changes to their MPI and communication libraries. IB would suffer the same problems XPMEM does in that the TLB entries need to be removed on a remote host which is operating completely independently. Thanks, Robin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: are rather similar. However I have tried to make a point of minimising the impact the the core mm/. I don't see why we need to invalidate or flush I also tried hard to minimise the impact of the core mm/, I also argued with Christoph that cluttering mm/ wasn't a good idea for things like age_page that could be a 1 liner change instead of a multiple-liner change, without any loss of flexibility or readability. anything when changing the pte to be _more_ permissive, and I don't Note that in my patch the invalidate_pages in mprotect can be trivially switched to a mprotect_pages with proper params. This will prevent page faults completely in the secondary MMU (there will only be tlb misses after the tlb flush just like for the core linux pte), and it'll allow all the secondary MMU pte blocks (512/1024 at time with my PT lock design) to be updated to have proper permissions matching the core linux pte. understand the need for invalidate_begin/invalidate_end pairs at all. The need of the pairs is crystal clear to me: range_begin is needed for GRU _but_only_if_ range_end is called after releasing the reference that the VM holds on the page. _begin will flush the GRU tlb and at the same time it will take a mutex that will block further GRU tlb-miss-interrupts (no idea how they manange those nightmare locking, I didn't even try to add more locking to KVM and I get away with the fact KVM takes the pin on the page itself). My patch calls invalidate_page/pages before the reference is released on the page, so GRU will work fine despite lack of range_begin. Furthermore with my patch GRU will be auto-serialized by the PT lock w/o the need of any additional locking. What I have done is basically create it so that the notifiers get called basically in the same place as the normal TLB flushing is done, and nowhere else. That was one of my objectives too. I also wanted to avoid calling notifier code from inside eg. hardware TLB or pte manipulation primitives. These things are already pretty well spaghetti, so I'd like to just place them right where needed first... I think eventually it will need a bit of a rethink to make it more consistent and more general. But I prefer to do put them in the caller for the moment. Your patch should also work for KVM but it's suboptimal, my patch can be orders of magnitude more efficient for GRU thanks to the invalidate_pages optimization. Christoph complained about having to call one method per pte. And adding invalidate_range is useless unless you fully support xpmem. You're calling invalidate_range in places that can't sleep... No idea why xpmem needs range_begin, I perfectly understand why GRU needs _begin with Chrisotph's patch (gru lacks the page pin) but I dunno why xpmem needs range_begin (xpmem has the page pin so I also think it could avoid using range_begin). Still to support GRU you need both to call invalidate_range in places that can sleep and you need the external rmap notifier. The moment you add xpmem into the equation your and my clean patches become Christoph's one... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: understand the need for invalidate_begin/invalidate_end pairs at all. The need of the pairs is crystal clear to me: range_begin is needed for GRU _but_only_if_ range_end is called after releasing the reference that the VM holds on the page. _begin will flush the GRU tlb and at the same time it will take a mutex that will block further GRU tlb-miss-interrupts (no idea how they manange those nightmare locking, I didn't even try to add more locking to KVM and I get away with the fact KVM takes the pin on the page itself). As it turns out, no actual mutex is required. _begin_ simply increments a count of active range invalidates, _end_ decrements the count. New TLB dropins are deferred while range callouts are active. This would appear to be racy but the GRU has special hardware that simplifies locking. When the GRU sees a TLB invalidate, all outstanding misses potentially inflight TLB dropins are marked by the GRU with a kill bit. When the dropin finally occurs, the dropin is ignored the instruction is simply restarted. The instruction will fault again the TLB dropin will be repeated. This is optimized for the case where invalidates are rare - true for users of the GRU. In general, though, I agree. Most users of mmu_notifiers would likely required a mutex or something equivalent. --- jack -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: are rather similar. However I have tried to make a point of minimising the impact the the core mm/. I don't see why we need to invalidate or flush I also tried hard to minimise the impact of the core mm/, I also argued with Christoph that cluttering mm/ wasn't a good idea for things like age_page that could be a 1 liner change instead of a multiple-liner change, without any loss of flexibility or readability. anything when changing the pte to be _more_ permissive, and I don't Note that in my patch the invalidate_pages in mprotect can be trivially switched to a mprotect_pages with proper params. This will prevent page faults completely in the secondary MMU (there will only be tlb misses after the tlb flush just like for the core linux pte), and it'll allow all the secondary MMU pte blocks (512/1024 at time with my PT lock design) to be updated to have proper permissions matching the core linux pte. understand the need for invalidate_begin/invalidate_end pairs at all. The need of the pairs is crystal clear to me: range_begin is needed for GRU _but_only_if_ range_end is called after releasing the reference that the VM holds on the page. _begin will flush the GRU tlb and at the same time it will take a mutex that will block further GRU tlb-miss-interrupts (no idea how they manange those nightmare locking, I didn't even try to add more locking to KVM and I get away with the fact KVM takes the pin on the page itself). My patch calls invalidate_page/pages before the reference is released on the page, so GRU will work fine despite lack of range_begin. Furthermore with my patch GRU will be auto-serialized by the PT lock w/o the need of any additional locking. That's why I don't understand the need for the pairs: it should be done like this. What I have done is basically create it so that the notifiers get called basically in the same place as the normal TLB flushing is done, and nowhere else. That was one of my objectives too. I also wanted to avoid calling notifier code from inside eg. hardware TLB or pte manipulation primitives. These things are already pretty well spaghetti, so I'd like to just place them right where needed first... I think eventually it will need a bit of a rethink to make it more consistent and more general. But I prefer to do put them in the caller for the moment. Your patch should also work for KVM but it's suboptimal, my patch can be orders of magnitude more efficient for GRU thanks to the invalidate_pages optimization. Christoph complained about having to call one method per pte. OK, I didn't see the invalidate_pages call... And adding invalidate_range is useless unless you fully support xpmem. You're calling invalidate_range in places that can't sleep... I thought that could be used by a non-sleeping user (not intending to try supporting sleeping users). If it is useless then it should go away (BTW. I didn't see your recent patch, some of my confusion I think stems from Christoph's novel way of merging and splitting patches). No idea why xpmem needs range_begin, I perfectly understand why GRU needs _begin with Chrisotph's patch (gru lacks the page pin) but I dunno why xpmem needs range_begin (xpmem has the page pin so I also think it could avoid using range_begin). Still to support GRU you need both to call invalidate_range in places that can sleep and you need the external rmap notifier. The moment you add xpmem into the equation your and my clean patches become Christoph's one... Sorry, I kind of didn't have time to follow the conversation so well before; are there patches posted for gru and/or xpmem? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote: On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: understand the need for invalidate_begin/invalidate_end pairs at all. The need of the pairs is crystal clear to me: range_begin is needed for GRU _but_only_if_ range_end is called after releasing the reference that the VM holds on the page. _begin will flush the GRU tlb and at the same time it will take a mutex that will block further GRU tlb-miss-interrupts (no idea how they manange those nightmare locking, I didn't even try to add more locking to KVM and I get away with the fact KVM takes the pin on the page itself). As it turns out, no actual mutex is required. _begin_ simply increments a count of active range invalidates, _end_ decrements the count. New TLB dropins are deferred while range callouts are active. This would appear to be racy but the GRU has special hardware that simplifies locking. When the GRU sees a TLB invalidate, all outstanding misses potentially inflight TLB dropins are marked by the GRU with a kill bit. When the dropin finally occurs, the dropin is ignored the instruction is simply restarted. The instruction will fault again the TLB dropin will be repeated. This is optimized for the case where invalidates are rare - true for users of the GRU. OK (thanks to Robin as well). Now I understand why you are using it, but I don't understand why you don't defer new TLBs after the point where the linux pte changes. If you can do that, then you look and act much more like a TLB from the point of view of the Linux vm. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: anything when changing the pte to be _more_ permissive, and I don't Note that in my patch the invalidate_pages in mprotect can be trivially switched to a mprotect_pages with proper params. This will prevent page faults completely in the secondary MMU (there will only be tlb misses after the tlb flush just like for the core linux pte), and it'll allow all the secondary MMU pte blocks (512/1024 at time with my PT lock design) to be updated to have proper permissions matching the core linux pte. Sorry, I realise I still didn't get this through my head yet (and also have not seen your patch recently). So I don't know exactly what you are doing... But why does _anybody_ (why does Christoph's patches) need to invalidate when they are going to be more permissive? This should be done lazily by the driver, I would have thought. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: anything when changing the pte to be _more_ permissive, and I don't Note that in my patch the invalidate_pages in mprotect can be trivially switched to a mprotect_pages with proper params. This will prevent page faults completely in the secondary MMU (there will only be tlb misses after the tlb flush just like for the core linux pte), and it'll allow all the secondary MMU pte blocks (512/1024 at time with my PT lock design) to be updated to have proper permissions matching the core linux pte. Sorry, I realise I still didn't get this through my head yet (and also have not seen your patch recently). So I don't know exactly what you are doing... But why does _anybody_ (why does Christoph's patches) need to invalidate when they are going to be more permissive? This should be done lazily by the driver, I would have thought. Agree. Although for most real applications, the performance difference is probably negligible. --- jack -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 11:59:23PM +0100, Nick Piggin wrote: That's why I don't understand the need for the pairs: it should be done like this. Yes, except it can't be done like this for xpmem. OK, I didn't see the invalidate_pages call... See the last patch I posted to Andrew, you've probably looked at the old patches, the old patches didn't work for GRU and didn't work for xpmem and they weren't optimized to cluster the invalidates for each 4k-large-pte. I thought that could be used by a non-sleeping user (not intending to try supporting sleeping users). If it is useless then it should go away (BTW. I didn't see your recent patch, some of my confusion I think stems from Christoph's novel way of merging and splitting patches). I kept improving my patch in case the VM maintainers would consider xpmem requirements not workable from a linux-VM point of view, and they preferred to have something obviously safe, strightforward and non intrusive, despite it doesn't support the only sleeping user out there I know of (xpmem). My patch supports KVM and GRU (and any other not sleeping user). No idea why xpmem needs range_begin, I perfectly understand why GRU needs _begin with Chrisotph's patch (gru lacks the page pin) but I dunno why xpmem needs range_begin (xpmem has the page pin so I also think it could avoid using range_begin). Still to support GRU you need both to call invalidate_range in places that can sleep and you need the external rmap notifier. The moment you add xpmem into the equation your and my clean patches become Christoph's one... Sorry, I kind of didn't have time to follow the conversation so well before; are there patches posted for gru and/or xpmem? There's some xpmem code posted but the posted one isn't using the mmu notifiers yet. GRU code may be available from Jack. I only know for sure their requirements in terms of mmu notifiers. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 12:04:27AM +0100, Nick Piggin wrote: On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote: On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: understand the need for invalidate_begin/invalidate_end pairs at all. The need of the pairs is crystal clear to me: range_begin is needed for GRU _but_only_if_ range_end is called after releasing the reference that the VM holds on the page. _begin will flush the GRU tlb and at the same time it will take a mutex that will block further GRU tlb-miss-interrupts (no idea how they manange those nightmare locking, I didn't even try to add more locking to KVM and I get away with the fact KVM takes the pin on the page itself). As it turns out, no actual mutex is required. _begin_ simply increments a count of active range invalidates, _end_ decrements the count. New TLB dropins are deferred while range callouts are active. This would appear to be racy but the GRU has special hardware that simplifies locking. When the GRU sees a TLB invalidate, all outstanding misses potentially inflight TLB dropins are marked by the GRU with a kill bit. When the dropin finally occurs, the dropin is ignored the instruction is simply restarted. The instruction will fault again the TLB dropin will be repeated. This is optimized for the case where invalidates are rare - true for users of the GRU. OK (thanks to Robin as well). Now I understand why you are using it, but I don't understand why you don't defer new TLBs after the point where the linux pte changes. If you can do that, then you look and act much more like a TLB from the point of view of the Linux vm. Christoph was forced to put the invalidate_range callback _after_ dropping the PT lock because xpmem has to wait I/O there. But invalidate_range is called after freeing the VM reference on the pages so then GRU needed a _range_begin too because GRU has to flush the tlb before the VM reference on the page is released (xpmem and KVM pin the pages mapped by the secondary mmu, GRU doesn't). So then invalidate_range was renamed to invalidate_range_end. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: Sorry, I realise I still didn't get this through my head yet (and also have not seen your patch recently). So I don't know exactly what you are doing... The last version was posted here: http://marc.info/?l=kvm-develm=120321732521533w=2 But why does _anybody_ (why does Christoph's patches) need to invalidate when they are going to be more permissive? This should be done lazily by the driver, I would have thought. This can be done lazily by the driver yes. The place where I've an invalidate_pages in mprotect however can also become less permissive. It's simpler to invalidate always and it's not guaranteed the secondary mmu page fault is capable of refreshing the spte across a writeprotect fault. In the future this can be changed to mprotect_pages though, so no page fault will happen in the secondary mmu. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 01:52:06AM +0100, Andrea Arcangeli wrote: On Wed, Feb 20, 2008 at 12:04:27AM +0100, Nick Piggin wrote: OK (thanks to Robin as well). Now I understand why you are using it, but I don't understand why you don't defer new TLBs after the point where the linux pte changes. If you can do that, then you look and act much more like a TLB from the point of view of the Linux vm. Christoph was forced to put the invalidate_range callback _after_ dropping the PT lock because xpmem has to wait I/O there. But invalidate_range is called after freeing the VM reference on the pages so then GRU needed a _range_begin too because GRU has to flush the tlb before the VM reference on the page is released (xpmem and KVM pin the pages mapped by the secondary mmu, GRU doesn't). So then invalidate_range was renamed to invalidate_range_end. Currently, xpmem blocks faults for the range specified at the _begin callout, then shoots down remote TLBs and does the put_page for all the pages in the specified range. The _end callout merely removes the block. We do not do any wait for I/O. By the time we return from the _begin callout, all activity by the remotes is stopped, pages are dereferenced, and future faults are blocked until released by the _end callout. Thanks, Robin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: anything when changing the pte to be _more_ permissive, and I don't Note that in my patch the invalidate_pages in mprotect can be trivially switched to a mprotect_pages with proper params. This will prevent page faults completely in the secondary MMU (there will only be tlb misses after the tlb flush just like for the core linux pte), and it'll allow all the secondary MMU pte blocks (512/1024 at time with my PT lock design) to be updated to have proper permissions matching the core linux pte. Sorry, I realise I still didn't get this through my head yet (and also have not seen your patch recently). So I don't know exactly what you are doing... But why does _anybody_ (why does Christoph's patches) need to invalidate when they are going to be more permissive? This should be done lazily by the driver, I would have thought. I don't believe it should, but it probably does right now. I do know the case where a write fault where there is no need for a COW does not call out on the PTE change. I see no reason the others should not handle this as well. Just off the top of my head, I can only think of the mprotect case needing to special case the more permissive state and I don't think that changes PTEs at all, merely updates the VMA. Thanks, Robin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/