Re: [PATCH] mmu notifiers #v2
On Thu, Jan 17, 2008 at 08:21:16PM +0200, Izik Eidus wrote: > ohh i like it, this is cleaver solution, and i guess the cost of the > vmexits wont be too high if it will > be not too much aggressive Yes, and especially during swapping, the system isn't usually CPU bound. The idea is to pay with some vmexit minor fault when the CPU tends to be idle, to reduce the amount of swapouts. I say swapouts and not swapins because it will mostly help avoiding writing out swapcache to disk for no good reason. Swapins already have a chance not to generate any read-I/O if the removed spte is really hot. To make this work we still need notification from the VM about memory pressure and perhaps the slab shrinker method is enough even if it has a coarse granularity. Freeing sptes during memory pressure converges also with the objective of releasing pinned slab memory so that the spte cache can grow more freely (the 4k PAGE_SIZE for 0-order page defrag philosophy will also appreciate that to work). There are lots of details to figure out in an good implementation though but the basic idea converges on two fairly important fronts. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Andrea Arcangeli wrote: On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote: Rik van Riel wrote: On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli <[EMAIL PROTECTED]> wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? Right, selection of the guest OS pages to swap is partly random but wait: _only_ for the long-cached and hot spte entries. It's certainly not entirely random. As the shadow-cache is a bit dynamic, every new instantiated spte will refresh the PG_referenced bit in follow_page already (through minor faults). not-present fault of swapped non-present sptes, can trigger minor faults from swapcache too and they'll refresh young regular ptes. right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. Yes: the referenced bit in the mmu-notifier invalidate case isn't useful because it's set right before freeing the page. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk It probably still makes sense for sptes removed because of other reasons (not mmu notifier invalidates). agree so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache I think that should already be the case, see the mark_page_accessed in follow_page, isn't FOLL_TOUCH set, isn't it? yes you are right FOLL_TOUCH is set. The only thing we clearly miss is a logic that refreshes the PG_referenced bitflag for "hot" sptes that remains instantiated and cached for a long time. For regular linux ptes this is done by the cpu through the young bitflag. But note that not all architectures have the young bitflag support in hardware! So I suppose the swapping of the KVM task, is like the swapping any other task but on an alpha CPU. It works good enough in practice even if we clearly have room for further optimizations in this area (like there would be on archs w/o young bit updated in hardware too). To refresh the PG_referenced bit for long lived hot sptes, I think the easiest solution is to chain the sptes in a lru, and to start dropping them when memory pressure start. We could drop one spte every X pages collected by the VM. So the "age" time factor depends on the VM velocity and we totally avoid useless shadow page faults when there's no VM pressure. When VM pressure increases, the kvm non-present fault will then take care to refresh the PG_referenced bit. This should solve the aging-issue for long lived and hot sptes. This should improve the responsiveness of the guest OS during "initial" swap pressure (after the initial swap pressure, the working set finds itself in ram again). So it should avoid some swapout/swapin not required jitter during the initial swap. I see this mostly as a kvm internal optimization, not strictly related to the mmu notifiers though. ohh i like it, this is cleaver solution, and i guess the cost of the vmexits wont be too high if it will be not too much aggressive -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote: > Rik van Riel wrote: >> On Sun, 13 Jan 2008 17:24:18 +0100 >> Andrea Arcangeli <[EMAIL PROTECTED]> wrote: >> >> >>> In my basic initial patch I only track the tlb flushes which should be >>> the minimum required to have a nice linux-VM controlled swapping >>> behavior of the KVM gphysical memory. >> >> I have a vaguely related question on KVM swapping. >> >> Do page accesses inside KVM guests get propagated to the host >> OS, so Linux can choose a reasonable page for eviction, or is >> the pageout of KVM guest pages essentially random? Right, selection of the guest OS pages to swap is partly random but wait: _only_ for the long-cached and hot spte entries. It's certainly not entirely random. As the shadow-cache is a bit dynamic, every new instantiated spte will refresh the PG_referenced bit in follow_page already (through minor faults). not-present fault of swapped non-present sptes, can trigger minor faults from swapcache too and they'll refresh young regular ptes. > right now when kvm remove pte from the shadow cache, it mark as access the > page that this pte pointed to. Yes: the referenced bit in the mmu-notifier invalidate case isn't useful because it's set right before freeing the page. > it was a good solution untill the mmut notifiers beacuse the pages were > pinned and couldnt be swapped to disk It probably still makes sense for sptes removed because of other reasons (not mmu notifier invalidates). > so now it will have to do something more sophisticated or at least mark as > access every page pointed by pte > that get insrted to the shadow cache I think that should already be the case, see the mark_page_accessed in follow_page, isn't FOLL_TOUCH set, isn't it? The only thing we clearly miss is a logic that refreshes the PG_referenced bitflag for "hot" sptes that remains instantiated and cached for a long time. For regular linux ptes this is done by the cpu through the young bitflag. But note that not all architectures have the young bitflag support in hardware! So I suppose the swapping of the KVM task, is like the swapping any other task but on an alpha CPU. It works good enough in practice even if we clearly have room for further optimizations in this area (like there would be on archs w/o young bit updated in hardware too). To refresh the PG_referenced bit for long lived hot sptes, I think the easiest solution is to chain the sptes in a lru, and to start dropping them when memory pressure start. We could drop one spte every X pages collected by the VM. So the "age" time factor depends on the VM velocity and we totally avoid useless shadow page faults when there's no VM pressure. When VM pressure increases, the kvm non-present fault will then take care to refresh the PG_referenced bit. This should solve the aging-issue for long lived and hot sptes. This should improve the responsiveness of the guest OS during "initial" swap pressure (after the initial swap pressure, the working set finds itself in ram again). So it should avoid some swapout/swapin not required jitter during the initial swap. I see this mostly as a kvm internal optimization, not strictly related to the mmu notifiers though. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote: Rik van Riel wrote: On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli [EMAIL PROTECTED] wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? Right, selection of the guest OS pages to swap is partly random but wait: _only_ for the long-cached and hot spte entries. It's certainly not entirely random. As the shadow-cache is a bit dynamic, every new instantiated spte will refresh the PG_referenced bit in follow_page already (through minor faults). not-present fault of swapped non-present sptes, can trigger minor faults from swapcache too and they'll refresh young regular ptes. right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. Yes: the referenced bit in the mmu-notifier invalidate case isn't useful because it's set right before freeing the page. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk It probably still makes sense for sptes removed because of other reasons (not mmu notifier invalidates). so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache I think that should already be the case, see the mark_page_accessed in follow_page, isn't FOLL_TOUCH set, isn't it? The only thing we clearly miss is a logic that refreshes the PG_referenced bitflag for hot sptes that remains instantiated and cached for a long time. For regular linux ptes this is done by the cpu through the young bitflag. But note that not all architectures have the young bitflag support in hardware! So I suppose the swapping of the KVM task, is like the swapping any other task but on an alpha CPU. It works good enough in practice even if we clearly have room for further optimizations in this area (like there would be on archs w/o young bit updated in hardware too). To refresh the PG_referenced bit for long lived hot sptes, I think the easiest solution is to chain the sptes in a lru, and to start dropping them when memory pressure start. We could drop one spte every X pages collected by the VM. So the age time factor depends on the VM velocity and we totally avoid useless shadow page faults when there's no VM pressure. When VM pressure increases, the kvm non-present fault will then take care to refresh the PG_referenced bit. This should solve the aging-issue for long lived and hot sptes. This should improve the responsiveness of the guest OS during initial swap pressure (after the initial swap pressure, the working set finds itself in ram again). So it should avoid some swapout/swapin not required jitter during the initial swap. I see this mostly as a kvm internal optimization, not strictly related to the mmu notifiers though. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Thu, Jan 17, 2008 at 08:21:16PM +0200, Izik Eidus wrote: ohh i like it, this is cleaver solution, and i guess the cost of the vmexits wont be too high if it will be not too much aggressive Yes, and especially during swapping, the system isn't usually CPU bound. The idea is to pay with some vmexit minor fault when the CPU tends to be idle, to reduce the amount of swapouts. I say swapouts and not swapins because it will mostly help avoiding writing out swapcache to disk for no good reason. Swapins already have a chance not to generate any read-I/O if the removed spte is really hot. To make this work we still need notification from the VM about memory pressure and perhaps the slab shrinker method is enough even if it has a coarse granularity. Freeing sptes during memory pressure converges also with the objective of releasing pinned slab memory so that the spte cache can grow more freely (the 4k PAGE_SIZE for 0-order page defrag philosophy will also appreciate that to work). There are lots of details to figure out in an good implementation though but the basic idea converges on two fairly important fronts. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Andrea Arcangeli wrote: On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote: Rik van Riel wrote: On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli [EMAIL PROTECTED] wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? Right, selection of the guest OS pages to swap is partly random but wait: _only_ for the long-cached and hot spte entries. It's certainly not entirely random. As the shadow-cache is a bit dynamic, every new instantiated spte will refresh the PG_referenced bit in follow_page already (through minor faults). not-present fault of swapped non-present sptes, can trigger minor faults from swapcache too and they'll refresh young regular ptes. right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. Yes: the referenced bit in the mmu-notifier invalidate case isn't useful because it's set right before freeing the page. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk It probably still makes sense for sptes removed because of other reasons (not mmu notifier invalidates). agree so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache I think that should already be the case, see the mark_page_accessed in follow_page, isn't FOLL_TOUCH set, isn't it? yes you are right FOLL_TOUCH is set. The only thing we clearly miss is a logic that refreshes the PG_referenced bitflag for hot sptes that remains instantiated and cached for a long time. For regular linux ptes this is done by the cpu through the young bitflag. But note that not all architectures have the young bitflag support in hardware! So I suppose the swapping of the KVM task, is like the swapping any other task but on an alpha CPU. It works good enough in practice even if we clearly have room for further optimizations in this area (like there would be on archs w/o young bit updated in hardware too). To refresh the PG_referenced bit for long lived hot sptes, I think the easiest solution is to chain the sptes in a lru, and to start dropping them when memory pressure start. We could drop one spte every X pages collected by the VM. So the age time factor depends on the VM velocity and we totally avoid useless shadow page faults when there's no VM pressure. When VM pressure increases, the kvm non-present fault will then take care to refresh the PG_referenced bit. This should solve the aging-issue for long lived and hot sptes. This should improve the responsiveness of the guest OS during initial swap pressure (after the initial swap pressure, the working set finds itself in ram again). So it should avoid some swapout/swapin not required jitter during the initial swap. I see this mostly as a kvm internal optimization, not strictly related to the mmu notifiers though. ohh i like it, this is cleaver solution, and i guess the cost of the vmexits wont be too high if it will be not too much aggressive -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Rik van Riel wrote: On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli <[EMAIL PROTECTED]> wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > In my basic initial patch I only track the tlb flushes which should be > the minimum required to have a nice linux-VM controlled swapping > behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Wed, Jan 16, 2008 at 10:01:32AM +0100, Brice Goglin wrote: > One of the difference with my patch is that you attach the notifier list to > the mm_struct while my code attached it to vmas. But I now don't think it > was such a good idea since it probably didn't reduce the number of notifier > calls a lot. Thanks for raising this topic. Notably KVM also would be a bit more optimal with the notifier in the vma and that was the original implementation too. It's not a sure thing that it has to be in the mm. The quadrics patch does a mixture, it attaches it to the mm but then it pretends to pass the vma down to the method, and it's broken doing so, like during munmap where it passes the first vma being unmapped but not all the later ones in the munmap range. If we want to attach it to the vma, I think the vma should be passed as parameter instead of the mm. In some places like apply_to_page_range the vma isn't even available and I found a little dirty to run a find_vma inside a #ifdef CONFIG_MMU_NOTIFIER. The only thing the vma could be interesting about are the protection bits for things like update_range in the quadrics patch where they prefetch their secondary tlb. But again if we want to do that, we need to hook inside unmap_vmas and to pass all the different vmas and not just the first one touched by unmap_vmas. unmap_vmas is _plural_ not singular ;). In the end attaching to mm avoided solving all the above troubles and provided a strightforward implementation where I would need a single call to mmu_notifier_register and other minor advantages like that and not much downside. But certainly the mm vs vma decision wasn't trivial (I switched back and forth a few times from vma to mm and back) and if people thinks this shall be in the vma I can try again but it won't be as a strightforward patch as for the mm. One benefit is for example is that it could go in the memslot and effectively the notifier->memslot conversion would be just a containerof instead of a "search" over the memslots. Locking aside. > Also, one thing that I looked at in vmaspy was notifying fork. I am not > sure what happens on Copy-on-write with your code, but for sure C-o-w is > problematic for shadow page tables. I thought shadow pages should just be > invalidated when a fork happens and the caller would refill them after > forcing C-o-w or so. So adding a notifier call there too might be nice. There can't be any cows right now in KVM VM backing store, that's why it's enough to get full swapping working fine. For example I think we'll need to add more notifiers to handle swapping of MAP_PRIVATE non linear tmpfs shared pages properly (and it won't be an issue with fork() but with after the fact sharing). Right now I'm more interested in the interface, for the invalidates, things like mm vs vma, the places where we hook under pte spinlock, things like that, then the patch can hopefully be merged and extended with more methods like ->change_protection_page/range and added to cow etc... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Andrea Arcangeli wrote: This patch is last version of a basic implementation of the mmu notifiers. In short when the linux VM decides to free a page, it will unmap it from the linux pagetables. However when a page is mapped not just by the regular linux ptes, but also from the shadow pagetables, it's currently unfreeable by the linux VM. This patch allows the shadow pagetables to be dropped and the page to be freed after that, if the linux VM decides to unmap the page from the main ptes because it wants to swap out the page. [...] Comments welcome... especially from SGI/IBM/Quadrics and all other potential users of this functionality. For HPC, this should be very interesting. Managing the registration cache of high-speed networks from user-space is a huge mess. This approach should help a lot. In fact, back in 2004, I implemented something similar called vmaspy to update the regcache of Myrinet drivers. I never submitted any patch because Infiniband would have been the only user in the mainline kernel and they were reluctant to these ideas [1]. In the meantime, some of them apparently changed their mind since they implemented some vmops-overriding hack to do something similar [2]. This patch should simplify all this. One of the difference with my patch is that you attach the notifier list to the mm_struct while my code attached it to vmas. But I now don't think it was such a good idea since it probably didn't reduce the number of notifier calls a lot. Also, one thing that I looked at in vmaspy was notifying fork. I am not sure what happens on Copy-on-write with your code, but for sure C-o-w is problematic for shadow page tables. I thought shadow pages should just be invalidated when a fork happens and the caller would refill them after forcing C-o-w or so. So adding a notifier call there too might be nice. Brice [1] http://lkml.org/lkml/2005/4/29/175 [2] http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Andrea Arcangeli wrote: This patch is last version of a basic implementation of the mmu notifiers. In short when the linux VM decides to free a page, it will unmap it from the linux pagetables. However when a page is mapped not just by the regular linux ptes, but also from the shadow pagetables, it's currently unfreeable by the linux VM. This patch allows the shadow pagetables to be dropped and the page to be freed after that, if the linux VM decides to unmap the page from the main ptes because it wants to swap out the page. [...] Comments welcome... especially from SGI/IBM/Quadrics and all other potential users of this functionality. For HPC, this should be very interesting. Managing the registration cache of high-speed networks from user-space is a huge mess. This approach should help a lot. In fact, back in 2004, I implemented something similar called vmaspy to update the regcache of Myrinet drivers. I never submitted any patch because Infiniband would have been the only user in the mainline kernel and they were reluctant to these ideas [1]. In the meantime, some of them apparently changed their mind since they implemented some vmops-overriding hack to do something similar [2]. This patch should simplify all this. One of the difference with my patch is that you attach the notifier list to the mm_struct while my code attached it to vmas. But I now don't think it was such a good idea since it probably didn't reduce the number of notifier calls a lot. Also, one thing that I looked at in vmaspy was notifying fork. I am not sure what happens on Copy-on-write with your code, but for sure C-o-w is problematic for shadow page tables. I thought shadow pages should just be invalidated when a fork happens and the caller would refill them after forcing C-o-w or so. So adding a notifier call there too might be nice. Brice [1] http://lkml.org/lkml/2005/4/29/175 [2] http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Wed, Jan 16, 2008 at 10:01:32AM +0100, Brice Goglin wrote: One of the difference with my patch is that you attach the notifier list to the mm_struct while my code attached it to vmas. But I now don't think it was such a good idea since it probably didn't reduce the number of notifier calls a lot. Thanks for raising this topic. Notably KVM also would be a bit more optimal with the notifier in the vma and that was the original implementation too. It's not a sure thing that it has to be in the mm. The quadrics patch does a mixture, it attaches it to the mm but then it pretends to pass the vma down to the method, and it's broken doing so, like during munmap where it passes the first vma being unmapped but not all the later ones in the munmap range. If we want to attach it to the vma, I think the vma should be passed as parameter instead of the mm. In some places like apply_to_page_range the vma isn't even available and I found a little dirty to run a find_vma inside a #ifdef CONFIG_MMU_NOTIFIER. The only thing the vma could be interesting about are the protection bits for things like update_range in the quadrics patch where they prefetch their secondary tlb. But again if we want to do that, we need to hook inside unmap_vmas and to pass all the different vmas and not just the first one touched by unmap_vmas. unmap_vmas is _plural_ not singular ;). In the end attaching to mm avoided solving all the above troubles and provided a strightforward implementation where I would need a single call to mmu_notifier_register and other minor advantages like that and not much downside. But certainly the mm vs vma decision wasn't trivial (I switched back and forth a few times from vma to mm and back) and if people thinks this shall be in the vma I can try again but it won't be as a strightforward patch as for the mm. One benefit is for example is that it could go in the memslot and effectively the notifier-memslot conversion would be just a containerof instead of a search over the memslots. Locking aside. Also, one thing that I looked at in vmaspy was notifying fork. I am not sure what happens on Copy-on-write with your code, but for sure C-o-w is problematic for shadow page tables. I thought shadow pages should just be invalidated when a fork happens and the caller would refill them after forcing C-o-w or so. So adding a notifier call there too might be nice. There can't be any cows right now in KVM VM backing store, that's why it's enough to get full swapping working fine. For example I think we'll need to add more notifiers to handle swapping of MAP_PRIVATE non linear tmpfs shared pages properly (and it won't be an issue with fork() but with after the fact sharing). Right now I'm more interested in the interface, for the invalidates, things like mm vs vma, the places where we hook under pte spinlock, things like that, then the patch can hopefully be merged and extended with more methods like -change_protection_page/range and added to cow etc... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli [EMAIL PROTECTED] wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? -- All rights reversed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Rik van Riel wrote: On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli [EMAIL PROTECTED] wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Wed, Jan 16, 2008 at 07:18:53AM +1100, Benjamin Herrenschmidt wrote: > Do you have cases where it's -not- called with the PTE lock held ? For invalidate_page no because currently it's only called next to the ptep_get_and_clear that modifies the pte and requires the pte lock. invalidate_range/release are called w/o pte lock held. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Tue, 2008-01-15 at 13:44 +0100, Andrea Arcangeli wrote: > On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote: > > Hmmm... In most of the callsites we hold a writelock on mmap_sem right? > > Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel > locking can't relay on the VM locks. > > About your request to schedule in the mmu notifier methods this is not > feasible right now, the notifier is often called with the pte > spinlocks held. I wonder if you can simply post/queue an event like a > softirq/pdflush. Do you have cases where it's -not- called with the PTE lock held ? Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote: > Hmmm... In most of the callsites we hold a writelock on mmap_sem right? Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel locking can't relay on the VM locks. About your request to schedule in the mmu notifier methods this is not feasible right now, the notifier is often called with the pte spinlocks held. I wonder if you can simply post/queue an event like a softirq/pdflush. > Passing mm is fine as long as mmap_sem is held. mmap_sem is not held, but don't worry "mm" can't go away under the mmu notifier, so it's ok. It's just that the KVM methods never uses "mm" at all (containerof translates the struct mmu_notifier to a struct kvm, and there is the mm in kvm->mm too). Perhaps others don't save the "mm" in their container where the mmu_notifier is embedded into, so I left mm as parameter to the methods. > Hmmm... this is ptep_clear_flush? What about the other uses of > flush_tlb_page in asm-generic/pgtable.h and related uses in arch code? This is not necessarily a 1:1 relationship with the tlb flushes. Otherwise they'd be the tlb-notifiers not the mmu-notifiers. The other methods in the pgtable.h are not dropping an user page from the "mm". That's the invalidate case right now. Other methods will not call into invalidate_page, but you're welcome to add other methods and call them from other ptep_* functions if you're interested about being notified about more than just the invalidates of the "mm". Is invalidate_page/range a clear enough method name to explain when the ptes and tlb entries have been dropped for such page/range mapped in userland in that address/range? > (would help if your patches would mention the function name in the diff > headers) my patches uses git diff defaults I guess, and they mention the function name in all other places, it's just git isn't smart enough there to catch the function name in that single place, it's ok. > > +#define mmu_notifier(function, mm, args...) > > \ > > + do {\ > > + struct mmu_notifier *__mn; \ > > + struct hlist_node *__n; \ > > + \ > > + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \ > > + if (__mn->ops->function)\ > > + __mn->ops->function(__mn, mm, args);\ > > + } while (0) > > Does this have to be inline? ptep_clear_flush will become quite big Inline makes the patch smaller and it avoids a call in the common case that the mmu_notifier list is empty. Perhaps I could add a: if (unlikely(!list_empty(&(mm)->mmu_notifier)) { ... } so gcc could offload the internal block in a cold-icache region of .text. I think at least an unlikely(!list_empty(&(mm)->mmu_notifier)) check has to be inline. Currently there isn't such check because I'm unsure if it really makes sense. The idea is that if you really care to optimize this you'll use self-modifying code to turn a nop into a call when a certain method is armed. That's an extreme optimization though, current code shouldn't be measurable already when disarmed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote: Hmmm... In most of the callsites we hold a writelock on mmap_sem right? Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel locking can't relay on the VM locks. About your request to schedule in the mmu notifier methods this is not feasible right now, the notifier is often called with the pte spinlocks held. I wonder if you can simply post/queue an event like a softirq/pdflush. Passing mm is fine as long as mmap_sem is held. mmap_sem is not held, but don't worry mm can't go away under the mmu notifier, so it's ok. It's just that the KVM methods never uses mm at all (containerof translates the struct mmu_notifier to a struct kvm, and there is the mm in kvm-mm too). Perhaps others don't save the mm in their container where the mmu_notifier is embedded into, so I left mm as parameter to the methods. Hmmm... this is ptep_clear_flush? What about the other uses of flush_tlb_page in asm-generic/pgtable.h and related uses in arch code? This is not necessarily a 1:1 relationship with the tlb flushes. Otherwise they'd be the tlb-notifiers not the mmu-notifiers. The other methods in the pgtable.h are not dropping an user page from the mm. That's the invalidate case right now. Other methods will not call into invalidate_page, but you're welcome to add other methods and call them from other ptep_* functions if you're interested about being notified about more than just the invalidates of the mm. Is invalidate_page/range a clear enough method name to explain when the ptes and tlb entries have been dropped for such page/range mapped in userland in that address/range? (would help if your patches would mention the function name in the diff headers) my patches uses git diff defaults I guess, and they mention the function name in all other places, it's just git isn't smart enough there to catch the function name in that single place, it's ok. +#define mmu_notifier(function, mm, args...) \ + do {\ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + hlist_for_each_entry(__mn, __n, (mm)-mmu_notifier, hlist) \ + if (__mn-ops-function)\ + __mn-ops-function(__mn, mm, args);\ + } while (0) Does this have to be inline? ptep_clear_flush will become quite big Inline makes the patch smaller and it avoids a call in the common case that the mmu_notifier list is empty. Perhaps I could add a: if (unlikely(!list_empty((mm)-mmu_notifier)) { ... } so gcc could offload the internal block in a cold-icache region of .text. I think at least an unlikely(!list_empty((mm)-mmu_notifier)) check has to be inline. Currently there isn't such check because I'm unsure if it really makes sense. The idea is that if you really care to optimize this you'll use self-modifying code to turn a nop into a call when a certain method is armed. That's an extreme optimization though, current code shouldn't be measurable already when disarmed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Tue, 2008-01-15 at 13:44 +0100, Andrea Arcangeli wrote: On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote: Hmmm... In most of the callsites we hold a writelock on mmap_sem right? Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel locking can't relay on the VM locks. About your request to schedule in the mmu notifier methods this is not feasible right now, the notifier is often called with the pte spinlocks held. I wonder if you can simply post/queue an event like a softirq/pdflush. Do you have cases where it's -not- called with the PTE lock held ? Ben. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Wed, Jan 16, 2008 at 07:18:53AM +1100, Benjamin Herrenschmidt wrote: Do you have cases where it's -not- called with the PTE lock held ? For invalidate_page no because currently it's only called next to the ptep_get_and_clear that modifies the pte and requires the pte lock. invalidate_range/release are called w/o pte lock held. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Mon, 2008-01-14 at 12:02 -0800, Christoph Lameter wrote: > On Sun, 13 Jan 2008, Andrea Arcangeli wrote: > > > About the locking perhaps I'm underestimating it, but by following the > > TLB flushing analogy, by simply clearing the shadow ptes (with kvm > > mmu_lock spinlock to avoid racing with other vcpu spte accesses of > > course) and flushing the shadow-pte after clearing the main linux pte, > > it should be enough to serialize against shadow-pte page faults that > > would call into get_user_pages. Flushing the host TLB before or after > > the shadow-ptes shouldn't matter. > > Hmmm... In most of the callsites we hold a writelock on mmap_sem right? Not in unmap_mapping_range() afaik. > > Comments welcome... especially from SGI/IBM/Quadrics and all other > > potential users of this functionality. > > > There are also certain details I'm uncertain about, like passing 'mm' > > to the lowlevel methods, my KVM usage of the invalidate_page() > > notifier for example only uses 'mm' for a BUG_ON for example: > > Passing mm is fine as long as mmap_sem is held. Passing mm is always a good idea, regardless of the mmap_sem, it can be useful for lots of other things :-) > > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h > > --- a/include/asm-generic/pgtable.h > > +++ b/include/asm-generic/pgtable.h > > @@ -86,6 +86,7 @@ do { > > \ > > pte_t __pte;\ > > __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \ > > flush_tlb_page(__vma, __address); \ > > + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \ > > __pte; \ > > }) > > #endif > > Hmmm... this is ptep_clear_flush? What about the other uses of > flush_tlb_page in asm-generic/pgtable.h and related uses in arch code? > (would help if your patches would mention the function name in the diff > headers) Note that last I looked, a lot of these were stale. Might be time to resume my spring/summer cleaning of page table accessors... > > +#define mmu_notifier(function, mm, args...) > > \ > > + do {\ > > + struct mmu_notifier *__mn; \ > > + struct hlist_node *__n; \ > > + \ > > + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \ > > + if (__mn->ops->function)\ > > + __mn->ops->function(__mn, mm, args);\ > > + } while (0) > > Does this have to be inline? ptep_clear_flush will become quite big > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to [EMAIL PROTECTED] For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"[EMAIL PROTECTED]"> [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Sun, 13 Jan 2008, Andrea Arcangeli wrote: > About the locking perhaps I'm underestimating it, but by following the > TLB flushing analogy, by simply clearing the shadow ptes (with kvm > mmu_lock spinlock to avoid racing with other vcpu spte accesses of > course) and flushing the shadow-pte after clearing the main linux pte, > it should be enough to serialize against shadow-pte page faults that > would call into get_user_pages. Flushing the host TLB before or after > the shadow-ptes shouldn't matter. Hmmm... In most of the callsites we hold a writelock on mmap_sem right? > Comments welcome... especially from SGI/IBM/Quadrics and all other > potential users of this functionality. > There are also certain details I'm uncertain about, like passing 'mm' > to the lowlevel methods, my KVM usage of the invalidate_page() > notifier for example only uses 'mm' for a BUG_ON for example: Passing mm is fine as long as mmap_sem is held. > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h > --- a/include/asm-generic/pgtable.h > +++ b/include/asm-generic/pgtable.h > @@ -86,6 +86,7 @@ do { > \ > pte_t __pte;\ > __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \ > flush_tlb_page(__vma, __address); \ > + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \ > __pte; \ > }) > #endif Hmmm... this is ptep_clear_flush? What about the other uses of flush_tlb_page in asm-generic/pgtable.h and related uses in arch code? (would help if your patches would mention the function name in the diff headers) > +#define mmu_notifier(function, mm, args...) \ > + do {\ > + struct mmu_notifier *__mn; \ > + struct hlist_node *__n; \ > + \ > + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \ > + if (__mn->ops->function)\ > + __mn->ops->function(__mn, mm, args);\ > + } while (0) Does this have to be inline? ptep_clear_flush will become quite big -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Sun, 13 Jan 2008, Andrea Arcangeli wrote: About the locking perhaps I'm underestimating it, but by following the TLB flushing analogy, by simply clearing the shadow ptes (with kvm mmu_lock spinlock to avoid racing with other vcpu spte accesses of course) and flushing the shadow-pte after clearing the main linux pte, it should be enough to serialize against shadow-pte page faults that would call into get_user_pages. Flushing the host TLB before or after the shadow-ptes shouldn't matter. Hmmm... In most of the callsites we hold a writelock on mmap_sem right? Comments welcome... especially from SGI/IBM/Quadrics and all other potential users of this functionality. There are also certain details I'm uncertain about, like passing 'mm' to the lowlevel methods, my KVM usage of the invalidate_page() notifier for example only uses 'mm' for a BUG_ON for example: Passing mm is fine as long as mmap_sem is held. diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -86,6 +86,7 @@ do { \ pte_t __pte;\ __pte = ptep_get_and_clear((__vma)-vm_mm, __address, __ptep); \ flush_tlb_page(__vma, __address); \ + mmu_notifier(invalidate_page, (__vma)-vm_mm, __address); \ __pte; \ }) #endif Hmmm... this is ptep_clear_flush? What about the other uses of flush_tlb_page in asm-generic/pgtable.h and related uses in arch code? (would help if your patches would mention the function name in the diff headers) +#define mmu_notifier(function, mm, args...) \ + do {\ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + hlist_for_each_entry(__mn, __n, (mm)-mmu_notifier, hlist) \ + if (__mn-ops-function)\ + __mn-ops-function(__mn, mm, args);\ + } while (0) Does this have to be inline? ptep_clear_flush will become quite big -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Mon, 2008-01-14 at 12:02 -0800, Christoph Lameter wrote: On Sun, 13 Jan 2008, Andrea Arcangeli wrote: About the locking perhaps I'm underestimating it, but by following the TLB flushing analogy, by simply clearing the shadow ptes (with kvm mmu_lock spinlock to avoid racing with other vcpu spte accesses of course) and flushing the shadow-pte after clearing the main linux pte, it should be enough to serialize against shadow-pte page faults that would call into get_user_pages. Flushing the host TLB before or after the shadow-ptes shouldn't matter. Hmmm... In most of the callsites we hold a writelock on mmap_sem right? Not in unmap_mapping_range() afaik. Comments welcome... especially from SGI/IBM/Quadrics and all other potential users of this functionality. There are also certain details I'm uncertain about, like passing 'mm' to the lowlevel methods, my KVM usage of the invalidate_page() notifier for example only uses 'mm' for a BUG_ON for example: Passing mm is fine as long as mmap_sem is held. Passing mm is always a good idea, regardless of the mmap_sem, it can be useful for lots of other things :-) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -86,6 +86,7 @@ do { \ pte_t __pte;\ __pte = ptep_get_and_clear((__vma)-vm_mm, __address, __ptep); \ flush_tlb_page(__vma, __address); \ + mmu_notifier(invalidate_page, (__vma)-vm_mm, __address); \ __pte; \ }) #endif Hmmm... this is ptep_clear_flush? What about the other uses of flush_tlb_page in asm-generic/pgtable.h and related uses in arch code? (would help if your patches would mention the function name in the diff headers) Note that last I looked, a lot of these were stale. Might be time to resume my spring/summer cleaning of page table accessors... +#define mmu_notifier(function, mm, args...) \ + do {\ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + hlist_for_each_entry(__mn, __n, (mm)-mmu_notifier, hlist) \ + if (__mn-ops-function)\ + __mn-ops-function(__mn, mm, args);\ + } while (0) Does this have to be inline? ptep_clear_flush will become quite big -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to [EMAIL PROTECTED] For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Sun, 2008-01-13 at 17:24 +0100, Andrea Arcangeli wrote: > Hello, > > This patch is last version of a basic implementation of the mmu > notifiers. > > In short when the linux VM decides to free a page, it will unmap it > from the linux pagetables. However when a page is mapped not just by > the regular linux ptes, but also from the shadow pagetables, it's > currently unfreeable by the linux VM. > > This patch allows the shadow pagetables to be dropped and the page to > be freed after that, if the linux VM decides to unmap the page from > the main ptes because it wants to swap out the page. Another potential user of that I can see is the DRM. Nowadays, graphic cards essentially have an MMU on chip, and can do paging. It would be nice to be able to map user objects in them without having to lock them down using your callback to properly mark them cast out on the card. Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
On Sun, 2008-01-13 at 17:24 +0100, Andrea Arcangeli wrote: Hello, This patch is last version of a basic implementation of the mmu notifiers. In short when the linux VM decides to free a page, it will unmap it from the linux pagetables. However when a page is mapped not just by the regular linux ptes, but also from the shadow pagetables, it's currently unfreeable by the linux VM. This patch allows the shadow pagetables to be dropped and the page to be freed after that, if the linux VM decides to unmap the page from the main ptes because it wants to swap out the page. Another potential user of that I can see is the DRM. Nowadays, graphic cards essentially have an MMU on chip, and can do paging. It would be nice to be able to map user objects in them without having to lock them down using your callback to properly mark them cast out on the card. Ben. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/