Re: [libvirt] Overhead for a default cpu cg placement scheme
On Thu, Jun 18, 2015 at 12:09 PM, Daniel P. Berrange wrote: > On Wed, Jun 17, 2015 at 10:55:35PM +0300, Andrey Korolyov wrote: >> >> Sorry for a delay, the 'perf numa numa-mem -p 8 -t 2 -P 384 -C 0 -M 0 >> -s 200 -zZq --thp 1 --no-data_rand_walk' exposes a difference of value >> 0.96 by 1. The trick I did (and successfully forget) before is in >> setting the value of the cfs_quota in a machine wide group, up one >> level from individual vcpus. >> >> Right now, libvirt sets values from >> >> 10 >> 20 >> >> for each vCPU thread cgroup, which is a bit wrong by my understanding , like >> /cgroup/cpu/machine/vmxx/vcpu0: period=10, quota=200 >> /cgroup/cpu/machine/vmxx/vcpu1: period=10, quota=200 >> /cgroup/cpu/machine/vmxx/vcpu2: period=10, quota=200 >> /cgroup/cpu/machine/vmxx/vcpu3: period=10, quota=200 >> >> >> In other words, the user (me) assumed that he limited total >> consumption of the VM by two cores total, though all every thread can >> consume up to a single CPU, resulting in a four-core consumption >> instead. With different cpu count/quota/host cpu count ratios there >> would be different practical limitations with same period to quota >> ratio, where a single total quota will result in much more predictable >> top consumption. I had put the same quota to period ratio in a >> VM-level directory to meet the expectancies from a config setting and >> there one can observe a mentioned performance drop. >> >> With default placement there is no difference in a performance >> numbers, but the behavior of the libvirt itself is kinda controversial >> there. The documentation says that this is a right behavior as well, >> but I think that the limiting the vcpu group with total quota is far >> more flexible than per-vcpu limitations which can negatively impact >> single-threaded processes in the guest, plus the overall consumption >> should be recalculated every time when host core count or guest core >> count changes. Sorry for not mentioning the custom scheme before, if >> mine assumption about execution flexibility is plainly wrong, I`ll >> withdraw my concerns from above. I am using the 'mine' scheme for a >> couple of years in production and it is proved (for me) to be a far >> less complex for a workload balancing for a cpu-congested hypervisor >> than a generic one. > > As you say there are two possible directions libvirt was able to take > when implementing the schedular tunables. Either apply them to the > VM as a whole, or apply them to the individual vCPUS. We debated this > a fair bit, but in the end we took the per-VCPU approach. There were > two real compelling reasons. First, if users have 2 guests with > identical configurations, but give one of the guests 2 vCPUs and the > other guest 4 vCPUs, the general expectation is that the one with > 4 vCPUS will have twice the performance. If we apply the CFS tuning > at the VM level, then as you added vCPUs you'd get no increase in > performance. The second reason was that people wanted to be able to > control performance of the emulator threads, separately from the > vCPU threads. Now we also have dedicated I/O threads that can have > different tuning set. This would be impossible if we were always > setting stuff at the VM level. > > It would in theory be possible for us to add a further tunable to the > VM config which allowed VM level tuning. eg we could define something > like > > >10 >20 > > > Semantically, if was set, we would then forbid use of the > and configurations, as they'd be mutually > exclusive. In such a case we'd avoid creating the sub-cgroups for > vCPUs and emulator threads, etc. > > The question is whether the benefit would outweigh the extra code > complexity to deal with this. I appreciate you would desire this > kind of setup, but I think we'd probably need more than one person > requesting use of this kind of setup in order to justify the work > involved. > Thanks for a quite awesome explanation! I see, the thing that is obvious for Xen-era hosting (more vCPUs means more power) is not an obvious thing for myself. I agree with the fact that less count of more powerful cores is always preferable over a large set of 'weak on average' cores with the approach I proposed. The thing that is still confusing is that the one should mind *three* exact things while setting a limit in a current scheme - real or HT core count, the VM` core count and the quota to period ratio itself to determine an upper cap for a designated VM` consumption, and it would be even more confusing when we will talk for a share ratios - for me, it is completely unclear how two VMs with 2:1 share ratio for both vCPUs and emulator would behave, will the emulator thread starve first on a CPU congestion or vice-versa, will the many vCPU processes with equal share to an emulator make enough influence inside a capped node to displace the actual available bandwidths from 2:1, will the guest emulator
Re: [libvirt] Overhead for a default cpu cg placement scheme
On Wed, Jun 17, 2015 at 10:55:35PM +0300, Andrey Korolyov wrote: > > Sorry for a delay, the 'perf numa numa-mem -p 8 -t 2 -P 384 -C 0 -M 0 > -s 200 -zZq --thp 1 --no-data_rand_walk' exposes a difference of value > 0.96 by 1. The trick I did (and successfully forget) before is in > setting the value of the cfs_quota in a machine wide group, up one > level from individual vcpus. > > Right now, libvirt sets values from > > 10 > 20 > > for each vCPU thread cgroup, which is a bit wrong by my understanding , like > /cgroup/cpu/machine/vmxx/vcpu0: period=10, quota=200 > /cgroup/cpu/machine/vmxx/vcpu1: period=10, quota=200 > /cgroup/cpu/machine/vmxx/vcpu2: period=10, quota=200 > /cgroup/cpu/machine/vmxx/vcpu3: period=10, quota=200 > > > In other words, the user (me) assumed that he limited total > consumption of the VM by two cores total, though all every thread can > consume up to a single CPU, resulting in a four-core consumption > instead. With different cpu count/quota/host cpu count ratios there > would be different practical limitations with same period to quota > ratio, where a single total quota will result in much more predictable > top consumption. I had put the same quota to period ratio in a > VM-level directory to meet the expectancies from a config setting and > there one can observe a mentioned performance drop. > > With default placement there is no difference in a performance > numbers, but the behavior of the libvirt itself is kinda controversial > there. The documentation says that this is a right behavior as well, > but I think that the limiting the vcpu group with total quota is far > more flexible than per-vcpu limitations which can negatively impact > single-threaded processes in the guest, plus the overall consumption > should be recalculated every time when host core count or guest core > count changes. Sorry for not mentioning the custom scheme before, if > mine assumption about execution flexibility is plainly wrong, I`ll > withdraw my concerns from above. I am using the 'mine' scheme for a > couple of years in production and it is proved (for me) to be a far > less complex for a workload balancing for a cpu-congested hypervisor > than a generic one. As you say there are two possible directions libvirt was able to take when implementing the schedular tunables. Either apply them to the VM as a whole, or apply them to the individual vCPUS. We debated this a fair bit, but in the end we took the per-VCPU approach. There were two real compelling reasons. First, if users have 2 guests with identical configurations, but give one of the guests 2 vCPUs and the other guest 4 vCPUs, the general expectation is that the one with 4 vCPUS will have twice the performance. If we apply the CFS tuning at the VM level, then as you added vCPUs you'd get no increase in performance. The second reason was that people wanted to be able to control performance of the emulator threads, separately from the vCPU threads. Now we also have dedicated I/O threads that can have different tuning set. This would be impossible if we were always setting stuff at the VM level. It would in theory be possible for us to add a further tunable to the VM config which allowed VM level tuning. eg we could define something like 10 20 Semantically, if was set, we would then forbid use of the and configurations, as they'd be mutually exclusive. In such a case we'd avoid creating the sub-cgroups for vCPUs and emulator threads, etc. The question is whether the benefit would outweigh the extra code complexity to deal with this. I appreciate you would desire this kind of setup, but I think we'd probably need more than one person requesting use of this kind of setup in order to justify the work involved. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Overhead for a default cpu cg placement scheme
On Thu, Jun 11, 2015 at 4:30 PM, Daniel P. Berrange wrote: > On Thu, Jun 11, 2015 at 04:24:18PM +0300, Andrey Korolyov wrote: >> On Thu, Jun 11, 2015 at 4:13 PM, Daniel P. Berrange >> wrote: >> > On Thu, Jun 11, 2015 at 04:06:59PM +0300, Andrey Korolyov wrote: >> >> On Thu, Jun 11, 2015 at 2:33 PM, Daniel P. Berrange >> >> wrote: >> >> > On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote: >> >> >> On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange >> >> >> wrote: >> >> >> > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote: >> >> >> >> Hi Daniel, >> >> >> >> >> >> >> >> would it possible to adopt an optional tunable for a virCgroup >> >> >> >> mechanism which targets to a disablement of a nested (per-thread) >> >> >> >> cgroup creation? Those are bringing visible overhead for >> >> >> >> many-threaded >> >> >> >> guest workloads, almost 5% in non-congested host CPU state, >> >> >> >> primarily >> >> >> >> because the host scheduler should make a much more decisions with >> >> >> >> those cgroups than without them. We also experienced a lot of host >> >> >> >> lockups with currently exploited cgroup placement and disabled >> >> >> >> nested >> >> >> >> behavior a couple of years ago. Though the current patch is simply >> >> >> >> carves out the mentioned behavior, leaving only top-level >> >> >> >> per-machine >> >> >> >> cgroups, it can serve for an upstream after some adaptation, that`s >> >> >> >> why I`m asking about a chance of its acceptance. This message is a >> >> >> >> kind of 'request of a feature', it either can be accepted/dropped >> >> >> >> from >> >> >> >> our side or someone may give a hand and redo it from scratch. The >> >> >> >> detailed benchmarks are related to a host 3.10.y, if anyone is >> >> >> >> interested in the numbers for latest stable, I can update those. >> >> >> > >> >> >> > When you say nested cgroup creation, as you referring to the modern >> >> >> > libvirt hierarchy, or the legacy hierarchy - as described here: >> >> >> > >> >> >> > http://libvirt.org/cgroups.html >> >> >> > >> >> >> > The current libvirt setup used for a year or so now is much shallower >> >> >> > than previously, to the extent that we'd consider performance >> >> >> > problems >> >> >> > with it to be the job of the kernel to fix. >> >> >> >> >> >> Thanks, I`m referring to a 'new nested' hiearchy for an overhead >> >> >> mentioned above. The host crashes I mentioned happened with old >> >> >> hierarchy back ago, forgot to mention this. Despite the flattening of >> >> >> the topo for the current scheme it should be possible to disable fine >> >> >> group creation for the VM threads for some users who don`t need >> >> >> per-vcpu cpu pinning/accounting (though overhead caused by a placement >> >> >> for cpu cgroup, not by accounting/pinning ones, I`m assuming equal >> >> >> distribution with such disablement for all nested-aware cgroup types), >> >> >> that`s the point for now. >> >> > >> >> > Ok, so the per-vCPU cgroups are used for a couple of things >> >> > >> >> > - Setting scheduler tunables - period/quota/shares/etc >> >> > - Setting CPU pinning >> >> > - Setting NUMA memory pinning >> >> > >> >> > In addition to the per-VCPU cgroup, we have one cgroup fr each >> >> > I/O thread, and also one more for general QEMU emulator threads. >> >> > >> >> > In the case of CPU pinning we already have automatic fallback to >> >> > sched_setaffinity if the CPUSET controller isn't available. >> >> > >> >> > We could in theory start off without the per-vCPU/emulator/I/O >> >> > cgroups and only create them as & when the feature is actually >> >> > used. The concern I would have though is that changing the cgroups >> >> > layout on the fly may cause unexpected sideeffects in behaviour of >> >> > the VM. More critically, there would be alot of places in the code >> >> > where we would need to deal with this which could hurt maintainability. >> >> > >> >> > How confident are you that the performance problems you see are inherant >> >> > to the actual use of the cgroups, and not instead as a result of some >> >> > particular bad choice of default parameters we might have left in the >> >> > cgroups ? In general I'd have a desire to try to work to eliminate the >> >> > perf impact before we consider the complexity of disabling this feature >> >> > >> >> > Regards, >> >> > Daniel >> >> >> >> Hm, what are you proposing to begin with in a testing terms? By my >> >> understanding the excessive cgroup usage along with small scheduler >> >> quanta *will* lead to some overhead anyway. Let`s look at the numbers >> >> which I would bring tomorrow, the mentioned five percents was catched >> >> on a guest 'perf numa xxx' for a different kind of mappings and host >> >> behavior (post-3.8): memory automigration on/off, kind of 'numa >> >> passthrough', like grouping vcpu threads according to the host and >> >> emulated guest NUMA topologies, totally scattered and unpinned threads >> >>
Re: [libvirt] Overhead for a default cpu cg placement scheme
On Thu, Jun 11, 2015 at 04:24:18PM +0300, Andrey Korolyov wrote: > On Thu, Jun 11, 2015 at 4:13 PM, Daniel P. Berrange > wrote: > > On Thu, Jun 11, 2015 at 04:06:59PM +0300, Andrey Korolyov wrote: > >> On Thu, Jun 11, 2015 at 2:33 PM, Daniel P. Berrange > >> wrote: > >> > On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote: > >> >> On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange > >> >> wrote: > >> >> > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote: > >> >> >> Hi Daniel, > >> >> >> > >> >> >> would it possible to adopt an optional tunable for a virCgroup > >> >> >> mechanism which targets to a disablement of a nested (per-thread) > >> >> >> cgroup creation? Those are bringing visible overhead for > >> >> >> many-threaded > >> >> >> guest workloads, almost 5% in non-congested host CPU state, primarily > >> >> >> because the host scheduler should make a much more decisions with > >> >> >> those cgroups than without them. We also experienced a lot of host > >> >> >> lockups with currently exploited cgroup placement and disabled nested > >> >> >> behavior a couple of years ago. Though the current patch is simply > >> >> >> carves out the mentioned behavior, leaving only top-level per-machine > >> >> >> cgroups, it can serve for an upstream after some adaptation, that`s > >> >> >> why I`m asking about a chance of its acceptance. This message is a > >> >> >> kind of 'request of a feature', it either can be accepted/dropped > >> >> >> from > >> >> >> our side or someone may give a hand and redo it from scratch. The > >> >> >> detailed benchmarks are related to a host 3.10.y, if anyone is > >> >> >> interested in the numbers for latest stable, I can update those. > >> >> > > >> >> > When you say nested cgroup creation, as you referring to the modern > >> >> > libvirt hierarchy, or the legacy hierarchy - as described here: > >> >> > > >> >> > http://libvirt.org/cgroups.html > >> >> > > >> >> > The current libvirt setup used for a year or so now is much shallower > >> >> > than previously, to the extent that we'd consider performance problems > >> >> > with it to be the job of the kernel to fix. > >> >> > >> >> Thanks, I`m referring to a 'new nested' hiearchy for an overhead > >> >> mentioned above. The host crashes I mentioned happened with old > >> >> hierarchy back ago, forgot to mention this. Despite the flattening of > >> >> the topo for the current scheme it should be possible to disable fine > >> >> group creation for the VM threads for some users who don`t need > >> >> per-vcpu cpu pinning/accounting (though overhead caused by a placement > >> >> for cpu cgroup, not by accounting/pinning ones, I`m assuming equal > >> >> distribution with such disablement for all nested-aware cgroup types), > >> >> that`s the point for now. > >> > > >> > Ok, so the per-vCPU cgroups are used for a couple of things > >> > > >> > - Setting scheduler tunables - period/quota/shares/etc > >> > - Setting CPU pinning > >> > - Setting NUMA memory pinning > >> > > >> > In addition to the per-VCPU cgroup, we have one cgroup fr each > >> > I/O thread, and also one more for general QEMU emulator threads. > >> > > >> > In the case of CPU pinning we already have automatic fallback to > >> > sched_setaffinity if the CPUSET controller isn't available. > >> > > >> > We could in theory start off without the per-vCPU/emulator/I/O > >> > cgroups and only create them as & when the feature is actually > >> > used. The concern I would have though is that changing the cgroups > >> > layout on the fly may cause unexpected sideeffects in behaviour of > >> > the VM. More critically, there would be alot of places in the code > >> > where we would need to deal with this which could hurt maintainability. > >> > > >> > How confident are you that the performance problems you see are inherant > >> > to the actual use of the cgroups, and not instead as a result of some > >> > particular bad choice of default parameters we might have left in the > >> > cgroups ? In general I'd have a desire to try to work to eliminate the > >> > perf impact before we consider the complexity of disabling this feature > >> > > >> > Regards, > >> > Daniel > >> > >> Hm, what are you proposing to begin with in a testing terms? By my > >> understanding the excessive cgroup usage along with small scheduler > >> quanta *will* lead to some overhead anyway. Let`s look at the numbers > >> which I would bring tomorrow, the mentioned five percents was catched > >> on a guest 'perf numa xxx' for a different kind of mappings and host > >> behavior (post-3.8): memory automigration on/off, kind of 'numa > >> passthrough', like grouping vcpu threads according to the host and > >> emulated guest NUMA topologies, totally scattered and unpinned threads > >> within a single and within a multiple NUMA nodes. As the result for > >> 3.10.y, there was a five-percent difference between best-performing > >> case with thread-level cpu cgroups and a 'totally s
Re: [libvirt] Overhead for a default cpu cg placement scheme
On Thu, Jun 11, 2015 at 4:13 PM, Daniel P. Berrange wrote: > On Thu, Jun 11, 2015 at 04:06:59PM +0300, Andrey Korolyov wrote: >> On Thu, Jun 11, 2015 at 2:33 PM, Daniel P. Berrange >> wrote: >> > On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote: >> >> On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange >> >> wrote: >> >> > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote: >> >> >> Hi Daniel, >> >> >> >> >> >> would it possible to adopt an optional tunable for a virCgroup >> >> >> mechanism which targets to a disablement of a nested (per-thread) >> >> >> cgroup creation? Those are bringing visible overhead for many-threaded >> >> >> guest workloads, almost 5% in non-congested host CPU state, primarily >> >> >> because the host scheduler should make a much more decisions with >> >> >> those cgroups than without them. We also experienced a lot of host >> >> >> lockups with currently exploited cgroup placement and disabled nested >> >> >> behavior a couple of years ago. Though the current patch is simply >> >> >> carves out the mentioned behavior, leaving only top-level per-machine >> >> >> cgroups, it can serve for an upstream after some adaptation, that`s >> >> >> why I`m asking about a chance of its acceptance. This message is a >> >> >> kind of 'request of a feature', it either can be accepted/dropped from >> >> >> our side or someone may give a hand and redo it from scratch. The >> >> >> detailed benchmarks are related to a host 3.10.y, if anyone is >> >> >> interested in the numbers for latest stable, I can update those. >> >> > >> >> > When you say nested cgroup creation, as you referring to the modern >> >> > libvirt hierarchy, or the legacy hierarchy - as described here: >> >> > >> >> > http://libvirt.org/cgroups.html >> >> > >> >> > The current libvirt setup used for a year or so now is much shallower >> >> > than previously, to the extent that we'd consider performance problems >> >> > with it to be the job of the kernel to fix. >> >> >> >> Thanks, I`m referring to a 'new nested' hiearchy for an overhead >> >> mentioned above. The host crashes I mentioned happened with old >> >> hierarchy back ago, forgot to mention this. Despite the flattening of >> >> the topo for the current scheme it should be possible to disable fine >> >> group creation for the VM threads for some users who don`t need >> >> per-vcpu cpu pinning/accounting (though overhead caused by a placement >> >> for cpu cgroup, not by accounting/pinning ones, I`m assuming equal >> >> distribution with such disablement for all nested-aware cgroup types), >> >> that`s the point for now. >> > >> > Ok, so the per-vCPU cgroups are used for a couple of things >> > >> > - Setting scheduler tunables - period/quota/shares/etc >> > - Setting CPU pinning >> > - Setting NUMA memory pinning >> > >> > In addition to the per-VCPU cgroup, we have one cgroup fr each >> > I/O thread, and also one more for general QEMU emulator threads. >> > >> > In the case of CPU pinning we already have automatic fallback to >> > sched_setaffinity if the CPUSET controller isn't available. >> > >> > We could in theory start off without the per-vCPU/emulator/I/O >> > cgroups and only create them as & when the feature is actually >> > used. The concern I would have though is that changing the cgroups >> > layout on the fly may cause unexpected sideeffects in behaviour of >> > the VM. More critically, there would be alot of places in the code >> > where we would need to deal with this which could hurt maintainability. >> > >> > How confident are you that the performance problems you see are inherant >> > to the actual use of the cgroups, and not instead as a result of some >> > particular bad choice of default parameters we might have left in the >> > cgroups ? In general I'd have a desire to try to work to eliminate the >> > perf impact before we consider the complexity of disabling this feature >> > >> > Regards, >> > Daniel >> >> Hm, what are you proposing to begin with in a testing terms? By my >> understanding the excessive cgroup usage along with small scheduler >> quanta *will* lead to some overhead anyway. Let`s look at the numbers >> which I would bring tomorrow, the mentioned five percents was catched >> on a guest 'perf numa xxx' for a different kind of mappings and host >> behavior (post-3.8): memory automigration on/off, kind of 'numa >> passthrough', like grouping vcpu threads according to the host and >> emulated guest NUMA topologies, totally scattered and unpinned threads >> within a single and within a multiple NUMA nodes. As the result for >> 3.10.y, there was a five-percent difference between best-performing >> case with thread-level cpu cgroups and a 'totally scattered' case on a >> simple mid-range two-headed node. If you think that the choice of an >> emulated workload is wrong, please let me know, I was afraid that the >> non-synthetic workload in the guest may suffer from a range of a side >> factors and therefo
Re: [libvirt] Overhead for a default cpu cg placement scheme
On Thu, Jun 11, 2015 at 04:06:59PM +0300, Andrey Korolyov wrote: > On Thu, Jun 11, 2015 at 2:33 PM, Daniel P. Berrange > wrote: > > On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote: > >> On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange > >> wrote: > >> > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote: > >> >> Hi Daniel, > >> >> > >> >> would it possible to adopt an optional tunable for a virCgroup > >> >> mechanism which targets to a disablement of a nested (per-thread) > >> >> cgroup creation? Those are bringing visible overhead for many-threaded > >> >> guest workloads, almost 5% in non-congested host CPU state, primarily > >> >> because the host scheduler should make a much more decisions with > >> >> those cgroups than without them. We also experienced a lot of host > >> >> lockups with currently exploited cgroup placement and disabled nested > >> >> behavior a couple of years ago. Though the current patch is simply > >> >> carves out the mentioned behavior, leaving only top-level per-machine > >> >> cgroups, it can serve for an upstream after some adaptation, that`s > >> >> why I`m asking about a chance of its acceptance. This message is a > >> >> kind of 'request of a feature', it either can be accepted/dropped from > >> >> our side or someone may give a hand and redo it from scratch. The > >> >> detailed benchmarks are related to a host 3.10.y, if anyone is > >> >> interested in the numbers for latest stable, I can update those. > >> > > >> > When you say nested cgroup creation, as you referring to the modern > >> > libvirt hierarchy, or the legacy hierarchy - as described here: > >> > > >> > http://libvirt.org/cgroups.html > >> > > >> > The current libvirt setup used for a year or so now is much shallower > >> > than previously, to the extent that we'd consider performance problems > >> > with it to be the job of the kernel to fix. > >> > >> Thanks, I`m referring to a 'new nested' hiearchy for an overhead > >> mentioned above. The host crashes I mentioned happened with old > >> hierarchy back ago, forgot to mention this. Despite the flattening of > >> the topo for the current scheme it should be possible to disable fine > >> group creation for the VM threads for some users who don`t need > >> per-vcpu cpu pinning/accounting (though overhead caused by a placement > >> for cpu cgroup, not by accounting/pinning ones, I`m assuming equal > >> distribution with such disablement for all nested-aware cgroup types), > >> that`s the point for now. > > > > Ok, so the per-vCPU cgroups are used for a couple of things > > > > - Setting scheduler tunables - period/quota/shares/etc > > - Setting CPU pinning > > - Setting NUMA memory pinning > > > > In addition to the per-VCPU cgroup, we have one cgroup fr each > > I/O thread, and also one more for general QEMU emulator threads. > > > > In the case of CPU pinning we already have automatic fallback to > > sched_setaffinity if the CPUSET controller isn't available. > > > > We could in theory start off without the per-vCPU/emulator/I/O > > cgroups and only create them as & when the feature is actually > > used. The concern I would have though is that changing the cgroups > > layout on the fly may cause unexpected sideeffects in behaviour of > > the VM. More critically, there would be alot of places in the code > > where we would need to deal with this which could hurt maintainability. > > > > How confident are you that the performance problems you see are inherant > > to the actual use of the cgroups, and not instead as a result of some > > particular bad choice of default parameters we might have left in the > > cgroups ? In general I'd have a desire to try to work to eliminate the > > perf impact before we consider the complexity of disabling this feature > > > > Regards, > > Daniel > > Hm, what are you proposing to begin with in a testing terms? By my > understanding the excessive cgroup usage along with small scheduler > quanta *will* lead to some overhead anyway. Let`s look at the numbers > which I would bring tomorrow, the mentioned five percents was catched > on a guest 'perf numa xxx' for a different kind of mappings and host > behavior (post-3.8): memory automigration on/off, kind of 'numa > passthrough', like grouping vcpu threads according to the host and > emulated guest NUMA topologies, totally scattered and unpinned threads > within a single and within a multiple NUMA nodes. As the result for > 3.10.y, there was a five-percent difference between best-performing > case with thread-level cpu cgroups and a 'totally scattered' case on a > simple mid-range two-headed node. If you think that the choice of an > emulated workload is wrong, please let me know, I was afraid that the > non-synthetic workload in the guest may suffer from a range of a side > factors and therefore chose perf for this task. Benchmarking isn't my area of expertize, but you should be able to just disable the CPUSET controller entirely in qem
Re: [libvirt] Overhead for a default cpu cg placement scheme
On Thu, Jun 11, 2015 at 2:33 PM, Daniel P. Berrange wrote: > On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote: >> On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange >> wrote: >> > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote: >> >> Hi Daniel, >> >> >> >> would it possible to adopt an optional tunable for a virCgroup >> >> mechanism which targets to a disablement of a nested (per-thread) >> >> cgroup creation? Those are bringing visible overhead for many-threaded >> >> guest workloads, almost 5% in non-congested host CPU state, primarily >> >> because the host scheduler should make a much more decisions with >> >> those cgroups than without them. We also experienced a lot of host >> >> lockups with currently exploited cgroup placement and disabled nested >> >> behavior a couple of years ago. Though the current patch is simply >> >> carves out the mentioned behavior, leaving only top-level per-machine >> >> cgroups, it can serve for an upstream after some adaptation, that`s >> >> why I`m asking about a chance of its acceptance. This message is a >> >> kind of 'request of a feature', it either can be accepted/dropped from >> >> our side or someone may give a hand and redo it from scratch. The >> >> detailed benchmarks are related to a host 3.10.y, if anyone is >> >> interested in the numbers for latest stable, I can update those. >> > >> > When you say nested cgroup creation, as you referring to the modern >> > libvirt hierarchy, or the legacy hierarchy - as described here: >> > >> > http://libvirt.org/cgroups.html >> > >> > The current libvirt setup used for a year or so now is much shallower >> > than previously, to the extent that we'd consider performance problems >> > with it to be the job of the kernel to fix. >> >> Thanks, I`m referring to a 'new nested' hiearchy for an overhead >> mentioned above. The host crashes I mentioned happened with old >> hierarchy back ago, forgot to mention this. Despite the flattening of >> the topo for the current scheme it should be possible to disable fine >> group creation for the VM threads for some users who don`t need >> per-vcpu cpu pinning/accounting (though overhead caused by a placement >> for cpu cgroup, not by accounting/pinning ones, I`m assuming equal >> distribution with such disablement for all nested-aware cgroup types), >> that`s the point for now. > > Ok, so the per-vCPU cgroups are used for a couple of things > > - Setting scheduler tunables - period/quota/shares/etc > - Setting CPU pinning > - Setting NUMA memory pinning > > In addition to the per-VCPU cgroup, we have one cgroup fr each > I/O thread, and also one more for general QEMU emulator threads. > > In the case of CPU pinning we already have automatic fallback to > sched_setaffinity if the CPUSET controller isn't available. > > We could in theory start off without the per-vCPU/emulator/I/O > cgroups and only create them as & when the feature is actually > used. The concern I would have though is that changing the cgroups > layout on the fly may cause unexpected sideeffects in behaviour of > the VM. More critically, there would be alot of places in the code > where we would need to deal with this which could hurt maintainability. > > How confident are you that the performance problems you see are inherant > to the actual use of the cgroups, and not instead as a result of some > particular bad choice of default parameters we might have left in the > cgroups ? In general I'd have a desire to try to work to eliminate the > perf impact before we consider the complexity of disabling this feature > > Regards, > Daniel Hm, what are you proposing to begin with in a testing terms? By my understanding the excessive cgroup usage along with small scheduler quanta *will* lead to some overhead anyway. Let`s look at the numbers which I would bring tomorrow, the mentioned five percents was catched on a guest 'perf numa xxx' for a different kind of mappings and host behavior (post-3.8): memory automigration on/off, kind of 'numa passthrough', like grouping vcpu threads according to the host and emulated guest NUMA topologies, totally scattered and unpinned threads within a single and within a multiple NUMA nodes. As the result for 3.10.y, there was a five-percent difference between best-performing case with thread-level cpu cgroups and a 'totally scattered' case on a simple mid-range two-headed node. If you think that the choice of an emulated workload is wrong, please let me know, I was afraid that the non-synthetic workload in the guest may suffer from a range of a side factors and therefore chose perf for this task. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Overhead for a default cpu cg placement scheme
On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote: > On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange > wrote: > > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote: > >> Hi Daniel, > >> > >> would it possible to adopt an optional tunable for a virCgroup > >> mechanism which targets to a disablement of a nested (per-thread) > >> cgroup creation? Those are bringing visible overhead for many-threaded > >> guest workloads, almost 5% in non-congested host CPU state, primarily > >> because the host scheduler should make a much more decisions with > >> those cgroups than without them. We also experienced a lot of host > >> lockups with currently exploited cgroup placement and disabled nested > >> behavior a couple of years ago. Though the current patch is simply > >> carves out the mentioned behavior, leaving only top-level per-machine > >> cgroups, it can serve for an upstream after some adaptation, that`s > >> why I`m asking about a chance of its acceptance. This message is a > >> kind of 'request of a feature', it either can be accepted/dropped from > >> our side or someone may give a hand and redo it from scratch. The > >> detailed benchmarks are related to a host 3.10.y, if anyone is > >> interested in the numbers for latest stable, I can update those. > > > > When you say nested cgroup creation, as you referring to the modern > > libvirt hierarchy, or the legacy hierarchy - as described here: > > > > http://libvirt.org/cgroups.html > > > > The current libvirt setup used for a year or so now is much shallower > > than previously, to the extent that we'd consider performance problems > > with it to be the job of the kernel to fix. > > Thanks, I`m referring to a 'new nested' hiearchy for an overhead > mentioned above. The host crashes I mentioned happened with old > hierarchy back ago, forgot to mention this. Despite the flattening of > the topo for the current scheme it should be possible to disable fine > group creation for the VM threads for some users who don`t need > per-vcpu cpu pinning/accounting (though overhead caused by a placement > for cpu cgroup, not by accounting/pinning ones, I`m assuming equal > distribution with such disablement for all nested-aware cgroup types), > that`s the point for now. Ok, so the per-vCPU cgroups are used for a couple of things - Setting scheduler tunables - period/quota/shares/etc - Setting CPU pinning - Setting NUMA memory pinning In addition to the per-VCPU cgroup, we have one cgroup fr each I/O thread, and also one more for general QEMU emulator threads. In the case of CPU pinning we already have automatic fallback to sched_setaffinity if the CPUSET controller isn't available. We could in theory start off without the per-vCPU/emulator/I/O cgroups and only create them as & when the feature is actually used. The concern I would have though is that changing the cgroups layout on the fly may cause unexpected sideeffects in behaviour of the VM. More critically, there would be alot of places in the code where we would need to deal with this which could hurt maintainability. How confident are you that the performance problems you see are inherant to the actual use of the cgroups, and not instead as a result of some particular bad choice of default parameters we might have left in the cgroups ? In general I'd have a desire to try to work to eliminate the perf impact before we consider the complexity of disabling this feature Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Overhead for a default cpu cg placement scheme
On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange wrote: > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote: >> Hi Daniel, >> >> would it possible to adopt an optional tunable for a virCgroup >> mechanism which targets to a disablement of a nested (per-thread) >> cgroup creation? Those are bringing visible overhead for many-threaded >> guest workloads, almost 5% in non-congested host CPU state, primarily >> because the host scheduler should make a much more decisions with >> those cgroups than without them. We also experienced a lot of host >> lockups with currently exploited cgroup placement and disabled nested >> behavior a couple of years ago. Though the current patch is simply >> carves out the mentioned behavior, leaving only top-level per-machine >> cgroups, it can serve for an upstream after some adaptation, that`s >> why I`m asking about a chance of its acceptance. This message is a >> kind of 'request of a feature', it either can be accepted/dropped from >> our side or someone may give a hand and redo it from scratch. The >> detailed benchmarks are related to a host 3.10.y, if anyone is >> interested in the numbers for latest stable, I can update those. > > When you say nested cgroup creation, as you referring to the modern > libvirt hierarchy, or the legacy hierarchy - as described here: > > http://libvirt.org/cgroups.html > > The current libvirt setup used for a year or so now is much shallower > than previously, to the extent that we'd consider performance problems > with it to be the job of the kernel to fix. > > Regards, > Daniel > -- Thanks, I`m referring to a 'new nested' hiearchy for an overhead mentioned above. The host crashes I mentioned happened with old hierarchy back ago, forgot to mention this. Despite the flattening of the topo for the current scheme it should be possible to disable fine group creation for the VM threads for some users who don`t need per-vcpu cpu pinning/accounting (though overhead caused by a placement for cpu cgroup, not by accounting/pinning ones, I`m assuming equal distribution with such disablement for all nested-aware cgroup types), that`s the point for now. -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Overhead for a default cpu cg placement scheme
On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote: > Hi Daniel, > > would it possible to adopt an optional tunable for a virCgroup > mechanism which targets to a disablement of a nested (per-thread) > cgroup creation? Those are bringing visible overhead for many-threaded > guest workloads, almost 5% in non-congested host CPU state, primarily > because the host scheduler should make a much more decisions with > those cgroups than without them. We also experienced a lot of host > lockups with currently exploited cgroup placement and disabled nested > behavior a couple of years ago. Though the current patch is simply > carves out the mentioned behavior, leaving only top-level per-machine > cgroups, it can serve for an upstream after some adaptation, that`s > why I`m asking about a chance of its acceptance. This message is a > kind of 'request of a feature', it either can be accepted/dropped from > our side or someone may give a hand and redo it from scratch. The > detailed benchmarks are related to a host 3.10.y, if anyone is > interested in the numbers for latest stable, I can update those. When you say nested cgroup creation, as you referring to the modern libvirt hierarchy, or the legacy hierarchy - as described here: http://libvirt.org/cgroups.html The current libvirt setup used for a year or so now is much shallower than previously, to the extent that we'd consider performance problems with it to be the job of the kernel to fix. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] Overhead for a default cpu cg placement scheme
Hi Daniel, would it possible to adopt an optional tunable for a virCgroup mechanism which targets to a disablement of a nested (per-thread) cgroup creation? Those are bringing visible overhead for many-threaded guest workloads, almost 5% in non-congested host CPU state, primarily because the host scheduler should make a much more decisions with those cgroups than without them. We also experienced a lot of host lockups with currently exploited cgroup placement and disabled nested behavior a couple of years ago. Though the current patch is simply carves out the mentioned behavior, leaving only top-level per-machine cgroups, it can serve for an upstream after some adaptation, that`s why I`m asking about a chance of its acceptance. This message is a kind of 'request of a feature', it either can be accepted/dropped from our side or someone may give a hand and redo it from scratch. The detailed benchmarks are related to a host 3.10.y, if anyone is interested in the numbers for latest stable, I can update those. Thanks! -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list