Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
On 12/19/2011 12:21 PM, Takuya Yoshikawa wrote: > (2011/12/19 19:03), Avi Kivity wrote: >>> IMO, The goal should be restricted to emergencies. >>> >>> So possible solution may be: >>> - we set the tuning parameters as conservative as possible >>> - pick up a guest with relatively high ratio >>>(I have to think more how to achieve this) >>> - move the vm_list head for fairness >>> >>> In an emergency, we should not mind performance penalty so much. >> >> But is the shrinker really only called in emergencies? > > No, sadly. > > That is the problem. It's not a problem - otherwise the icache and dcache would grow indefinitely. kvm's caches have another limit - perhaps we should remove it and let the shrinker manage the caches itself (but that requires a better selection algorithm). > >> >> Also, with things like cgroups, we may have an emergency in one >> container, but not in others - if the shrinker is not cgroup aware, it >> soon will be. > > That seems to be a common problem for everyone, not KVM only. It's really a requirement for dcache/icache, otherwise one container could force out icache/dcache for another. It doesn't concern us directly but we have to ensure that one guest cannot force out another's mmu pages. > >>> But there is not a perfect value because how often mmu_shrink() can be >>> called >>> will change if the admin change the sysctl_vfs_cache_pressure tuning >>> parameter >>> for dcache and icache, IIUC. >>> >>> And tdp and shadow paging differ much. >> >> We should aim for the following: >> - normal operation causes very little shrinks (some are okay) >> - high pressure mostly due to kvm results in kvm being shrunk (this is a >> pathological case caused by a starting a guest with a huge amount of >> memory, and mapping it all to /dev/zero (or ksm), and getting the guest >> the create shadow mappings for all of it) >> - general high pressure is shared among other caches like dcache and >> icache >> >> The cost of reestablishing an mmu page can be as high as half a >> millisecond of cpu time, which is the reason I want to be conservative. >> > > I agree with you. > > I feel that I should add lkml in CC next time to hear from mm specialist. > Shrinker has many heuristics added from a lot of experience; my lack of > such experience means I need help. Copying one expert. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
(2011/12/19 19:03), Avi Kivity wrote: IMO, The goal should be restricted to emergencies. So possible solution may be: - we set the tuning parameters as conservative as possible - pick up a guest with relatively high ratio (I have to think more how to achieve this) - move the vm_list head for fairness In an emergency, we should not mind performance penalty so much. But is the shrinker really only called in emergencies? No, sadly. That is the problem. Also, with things like cgroups, we may have an emergency in one container, but not in others - if the shrinker is not cgroup aware, it soon will be. That seems to be a common problem for everyone, not KVM only. But there is not a perfect value because how often mmu_shrink() can be called will change if the admin change the sysctl_vfs_cache_pressure tuning parameter for dcache and icache, IIUC. And tdp and shadow paging differ much. We should aim for the following: - normal operation causes very little shrinks (some are okay) - high pressure mostly due to kvm results in kvm being shrunk (this is a pathological case caused by a starting a guest with a huge amount of memory, and mapping it all to /dev/zero (or ksm), and getting the guest the create shadow mappings for all of it) - general high pressure is shared among other caches like dcache and icache The cost of reestablishing an mmu page can be as high as half a millisecond of cpu time, which is the reason I want to be conservative. I agree with you. I feel that I should add lkml in CC next time to hear from mm specialist. Shrinker has many heuristics added from a lot of experience; my lack of such experience means I need help. Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
On 12/19/2011 11:56 AM, Takuya Yoshikawa wrote: > (2011/12/19 18:26), Avi Kivity wrote: >> On 12/19/2011 11:22 AM, Takuya Yoshikawa wrote: Yes, it's very conservative. But on the other hand the shrinker is tuned for dcache and icache, where there are usually tons of useless objects. If we have to free something, I'd rather free them instead of mmu pages which tend to get recreated soon. >>> >>> >>> OK, to satisfy the requirements, I will do: >>> >>> 1. find the guest with the highest (shadow pages / memory) ratio >> >> How do you propose to do that efficiently? We may have hundreds of >> guests, or even more, on one host. Each guest access will involve >> bouncing a few cache lines. > > IMO, The goal should be restricted to emergencies. > > So possible solution may be: > - we set the tuning parameters as conservative as possible > - pick up a guest with relatively high ratio > (I have to think more how to achieve this) > - move the vm_list head for fairness > > In an emergency, we should not mind performance penalty so much. But is the shrinker really only called in emergencies? Also, with things like cgroups, we may have an emergency in one container, but not in others - if the shrinker is not cgroup aware, it soon will be. > >> >>> 2. just zap one page from that guest, keeping the current >>> conservative rate >>> >>> I will update the patch. >> >> I think the current rate is too conservative. No idea what a good one >> is, I don't have a feeling as to the relation between shrinker callbacks >> and memory pressure. >> > > When I tried to see what the current code is doing, frankly speaking, > I thought mmu_shrink() was not tested enough from the beginning. It wasn't, and in fact the original code was even worse, the code we have now is after some fixes. > > I read the shrinker code as far as possible and realized the > combination of > (seeks=10*default, batch=128) is not reasonable; the high seeks means the > shrinker rarely calculate higher value than 128, and mmu_shrink() cannot > be called in normal life. > > How about setting the batch a bit lower, keeping seeks as is? Ok. > > But there is not a perfect value because how often mmu_shrink() can be > called > will change if the admin change the sysctl_vfs_cache_pressure tuning > parameter > for dcache and icache, IIUC. > > And tdp and shadow paging differ much. We should aim for the following: - normal operation causes very little shrinks (some are okay) - high pressure mostly due to kvm results in kvm being shrunk (this is a pathological case caused by a starting a guest with a huge amount of memory, and mapping it all to /dev/zero (or ksm), and getting the guest the create shadow mappings for all of it) - general high pressure is shared among other caches like dcache and icache The cost of reestablishing an mmu page can be as high as half a millisecond of cpu time, which is the reason I want to be conservative. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
(2011/12/19 18:26), Avi Kivity wrote: On 12/19/2011 11:22 AM, Takuya Yoshikawa wrote: Yes, it's very conservative. But on the other hand the shrinker is tuned for dcache and icache, where there are usually tons of useless objects. If we have to free something, I'd rather free them instead of mmu pages which tend to get recreated soon. OK, to satisfy the requirements, I will do: 1. find the guest with the highest (shadow pages / memory) ratio How do you propose to do that efficiently? We may have hundreds of guests, or even more, on one host. Each guest access will involve bouncing a few cache lines. IMO, The goal should be restricted to emergencies. So possible solution may be: - we set the tuning parameters as conservative as possible - pick up a guest with relatively high ratio (I have to think more how to achieve this) - move the vm_list head for fairness In an emergency, we should not mind performance penalty so much. 2. just zap one page from that guest, keeping the current conservative rate I will update the patch. I think the current rate is too conservative. No idea what a good one is, I don't have a feeling as to the relation between shrinker callbacks and memory pressure. When I tried to see what the current code is doing, frankly speaking, I thought mmu_shrink() was not tested enough from the beginning. I read the shrinker code as far as possible and realized the combination of (seeks=10*default, batch=128) is not reasonable; the high seeks means the shrinker rarely calculate higher value than 128, and mmu_shrink() cannot be called in normal life. How about setting the batch a bit lower, keeping seeks as is? But there is not a perfect value because how often mmu_shrink() can be called will change if the admin change the sysctl_vfs_cache_pressure tuning parameter for dcache and icache, IIUC. And tdp and shadow paging differ much. Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
On 12/19/2011 11:22 AM, Takuya Yoshikawa wrote: >> Yes, it's very conservative. But on the other hand the shrinker is >> tuned for dcache and icache, where there are usually tons of useless >> objects. If we have to free something, I'd rather free them instead of >> mmu pages which tend to get recreated soon. >> > > > OK, to satisfy the requirements, I will do: > > 1. find the guest with the highest (shadow pages / memory) ratio How do you propose to do that efficiently? We may have hundreds of guests, or even more, on one host. Each guest access will involve bouncing a few cache lines. > 2. just zap one page from that guest, keeping the current > conservative rate > > I will update the patch. I think the current rate is too conservative. No idea what a good one is, I don't have a feeling as to the relation between shrinker callbacks and memory pressure. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
(2011/12/19 17:43), Avi Kivity wrote: Well, if one guest is twice as large as other guests, then it will want twice as many shadow pages. So our goal should be to zap pages from the guest with the highest (shadow pages / memory) ratio. Can you measure whether there is a significant difference in a synthetic workload, and what that change is? Perhaps apply {moderate, high} memory pressure load with {2, 4, 8, 16} VMs or something like that. I was running 4 VMs on my machine with enough high memory pressure. The problem was that mmu_shrink() was not tuned to be called in usual memory pressures: what I did was changing the seeks and batch parameters and making ept=0. At least, I have checked that if I make one VM do meaningless many copies, letting others keep silent, the shrinker frees shadow pages intensively from that one. Anyway, I don't think making the shrinker call mmu_shrink() with the default batch size, nr_to_scan=128, and just trying to free one shadow page is a good behaviour. Yes, it's very conservative. But on the other hand the shrinker is tuned for dcache and icache, where there are usually tons of useless objects. If we have to free something, I'd rather free them instead of mmu pages which tend to get recreated soon. OK, to satisfy the requirements, I will do: 1. find the guest with the highest (shadow pages / memory) ratio 2. just zap one page from that guest, keeping the current conservative rate I will update the patch. Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
On 12/16/2011 04:58 PM, Takuya Yoshikawa wrote: > > > > I'm not sure this is a meaningful test to verify this change is > > worthwhile, because while the shrinker tries to free a shadow page from > > one vm, the vm's position in the kvm list is changed, so the over time > > the shrinker will cycle over all VMs. > > The test was for checking if mmu_shrink() would work as intended. Maybe > not good as a changelog, sorry. > > > I admit that I could not find any strong reason except for protecting the > host from intentionally induced shadowing. > > But for that, don't you think that freeing the same amount of shadow pages > from good and bad VMs eaqually is bad thing? > > My method will try to free many shadow pages from VMs with many shadow > pages; e.g. if there is a pathological increase in shadow pages for one > VM, that one will be intensively treated. > > If you can agree on this reasoning, I will update the description and resend. Well, if one guest is twice as large as other guests, then it will want twice as many shadow pages. So our goal should be to zap pages from the guest with the highest (shadow pages / memory) ratio. > > > > Can you measure whether there is a significant difference in a synthetic > > workload, and what that change is? Perhaps apply {moderate, high} memory > > pressure load with {2, 4, 8, 16} VMs or something like that. > > > > I was running 4 VMs on my machine with enough high memory pressure. The > problem > was that mmu_shrink() was not tuned to be called in usual memory pressures: > what > I did was changing the seeks and batch parameters and making ept=0. > > At least, I have checked that if I make one VM do meaningless many copies, > letting > others keep silent, the shrinker frees shadow pages intensively from that one. > > > Anyway, I don't think making the shrinker call mmu_shrink() with the default > batch > size, nr_to_scan=128, and just trying to free one shadow page is a good > behaviour. Yes, it's very conservative. But on the other hand the shrinker is tuned for dcache and icache, where there are usually tons of useless objects. If we have to free something, I'd rather free them instead of mmu pages which tend to get recreated soon. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
On Fri, 16 Dec 2011 09:06:11 -0200 Marcelo Tosatti wrote: > On Mon, Dec 12, 2011 at 07:26:47AM +0900, Takuya Yoshikawa wrote: > > From: Takuya Yoshikawa > > > > Currently, mmu_shrink() tries to free a shadow page from one kvm and > > does not use nr_to_scan correctly. > > > > This patch fixes this by making it try to free some shadow pages from > > each kvm. The number of shadow pages each kvm frees becomes > > proportional to the number of shadow pages it is using. > > > > Note: an easy way to see how this code works is to do > > echo 3 > /proc/sys/vm/drop_caches > > during some virtual machines are running. Shadow pages will be zapped > > as expected by this. > > I'm not sure this is a meaningful test to verify this change is > worthwhile, because while the shrinker tries to free a shadow page from > one vm, the vm's position in the kvm list is changed, so the over time > the shrinker will cycle over all VMs. The test was for checking if mmu_shrink() would work as intended. Maybe not good as a changelog, sorry. I admit that I could not find any strong reason except for protecting the host from intentionally induced shadowing. But for that, don't you think that freeing the same amount of shadow pages from good and bad VMs eaqually is bad thing? My method will try to free many shadow pages from VMs with many shadow pages; e.g. if there is a pathological increase in shadow pages for one VM, that one will be intensively treated. If you can agree on this reasoning, I will update the description and resend. > > Can you measure whether there is a significant difference in a synthetic > workload, and what that change is? Perhaps apply {moderate, high} memory > pressure load with {2, 4, 8, 16} VMs or something like that. > I was running 4 VMs on my machine with enough high memory pressure. The problem was that mmu_shrink() was not tuned to be called in usual memory pressures: what I did was changing the seeks and batch parameters and making ept=0. At least, I have checked that if I make one VM do meaningless many copies, letting others keep silent, the shrinker frees shadow pages intensively from that one. Anyway, I don't think making the shrinker call mmu_shrink() with the default batch size, nr_to_scan=128, and just trying to free one shadow page is a good behaviour. Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
On Mon, Dec 12, 2011 at 07:26:47AM +0900, Takuya Yoshikawa wrote: > From: Takuya Yoshikawa > > Currently, mmu_shrink() tries to free a shadow page from one kvm and > does not use nr_to_scan correctly. > > This patch fixes this by making it try to free some shadow pages from > each kvm. The number of shadow pages each kvm frees becomes > proportional to the number of shadow pages it is using. > > Note: an easy way to see how this code works is to do > echo 3 > /proc/sys/vm/drop_caches > during some virtual machines are running. Shadow pages will be zapped > as expected by this. I'm not sure this is a meaningful test to verify this change is worthwhile, because while the shrinker tries to free a shadow page from one vm, the vm's position in the kvm list is changed, so the over time the shrinker will cycle over all VMs. Can you measure whether there is a significant difference in a synthetic workload, and what that change is? Perhaps apply {moderate, high} memory pressure load with {2, 4, 8, 16} VMs or something like that. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
From: Takuya Yoshikawa Currently, mmu_shrink() tries to free a shadow page from one kvm and does not use nr_to_scan correctly. This patch fixes this by making it try to free some shadow pages from each kvm. The number of shadow pages each kvm frees becomes proportional to the number of shadow pages it is using. Note: an easy way to see how this code works is to do echo 3 > /proc/sys/vm/drop_caches during some virtual machines are running. Shadow pages will be zapped as expected by this. Signed-off-by: Takuya Yoshikawa --- arch/x86/kvm/mmu.c | 23 ++- 1 files changed, 14 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index fcd0dd1..c6c61dd 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3921,7 +3921,7 @@ restart: static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc) { struct kvm *kvm; - struct kvm *kvm_freed = NULL; + int nr_to_zap, nr_total; int nr_to_scan = sc->nr_to_scan; if (nr_to_scan == 0) @@ -3929,25 +3929,30 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc) raw_spin_lock(&kvm_lock); + nr_total = percpu_counter_read_positive(&kvm_total_used_mmu_pages); + list_for_each_entry(kvm, &kvm_list, list) { int idx; LIST_HEAD(invalid_list); + if (nr_to_scan <= 0) { + /* next time from this kvm */ + list_move_tail(&kvm_list, &kvm->list); + break; + } + idx = srcu_read_lock(&kvm->srcu); spin_lock(&kvm->mmu_lock); - if (!kvm_freed && nr_to_scan > 0 && - kvm->arch.n_used_mmu_pages > 0) { - pre_zap_one_sp(kvm, &invalid_list); - kvm_freed = kvm; - } - nr_to_scan--; + /* proportional to how many shadow pages this kvm is using */ + nr_to_zap = sc->nr_to_scan * kvm->arch.n_used_mmu_pages; + nr_to_zap /= nr_total; + nr_to_scan -= pre_zap_some_sp(kvm, &invalid_list, nr_to_zap); kvm_mmu_commit_zap_page(kvm, &invalid_list); + spin_unlock(&kvm->mmu_lock); srcu_read_unlock(&kvm->srcu, idx); } - if (kvm_freed) - list_move_tail(&kvm_freed->list, &kvm_list); raw_spin_unlock(&kvm_lock); -- 1.7.5.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html