Am Fri, 23 Jun 2017 11:11:10 +0200 schrieb Sahid Orentino Ferdjaoui <sferd...@redhat.com>:
> On Wed, Jun 21, 2017 at 12:47:27PM +0200, Henning Schild wrote: > > Am Tue, 20 Jun 2017 10:04:30 -0400 > > schrieb Luiz Capitulino <lcapitul...@redhat.com>: > > > > > On Tue, 20 Jun 2017 09:48:23 +0200 > > > Henning Schild <henning.sch...@siemens.com> wrote: > > > > > > > Hi, > > > > > > > > We are using OpenStack for managing realtime guests. We modified > > > > it and contributed to discussions on how to model the realtime > > > > feature. More recent versions of OpenStack have support for > > > > realtime, and there are a few proposals on how to improve that > > > > further. > > > > > > > > But there is still no full answer on how to distribute threads > > > > across host-cores. The vcpus are easy but for the emulation and > > > > io-threads there are multiple options. I would like to collect > > > > the constraints from a qemu/kvm perspective first, and than > > > > possibly influence the OpenStack development > > > > > > > > I will put the summary/questions first, the text below provides > > > > more context to where the questions come from. > > > > - How do you distribute your threads when reaching the really > > > > low cyclictest results in the guests? In [3] Rik talked about > > > > problems like hold holder preemption, starvation etc. but not > > > > where/how to schedule emulators and io > > > > > > We put emulator threads and io-threads in housekeeping cores in > > > the host. I think housekeeping cores is what you're calling > > > best-effort cores, those are non-isolated cores that will run host > > > load. > > > > As expected, any best-effort/housekeeping core will do but overlap > > with the vcpu-cores is a bad idea. > > > > > > - Is it ok to put a vcpu and emulator thread on the same core as > > > > long as the guest knows about it? Any funny behaving guest, not > > > > just Linux. > > > > > > We can't do this for KVM-RT because we run all vcpu threads with > > > FIFO priority. > > > > Same point as above, meaning the "hw:cpu_realtime_mask" approach is > > wrong for realtime. > > > > > However, we have another project with DPDK whose goal is to > > > achieve zero-loss networking. The configuration required by this > > > project is very similar to the one required by KVM-RT. One > > > difference though is that we don't use RT and hence don't use > > > FIFO priority. > > > > > > In this project we've been running with the emulator thread and a > > > vcpu sharing the same core. As long as the guest housekeeping CPUs > > > are idle, we don't get any packet drops (most of the time, what > > > causes packet drops in this test-case would cause spikes in > > > cyclictest). However, we're seeing some packet drops for certain > > > guest workloads which we are still debugging. > > > > Ok but that seems to be a different scenario where hw:cpu_policy > > dedicated should be sufficient. However if the placement of the io > > and emulators has to be on a subset of the dedicated cpus something > > like hw:cpu_realtime_mask would be required. > > > > > > - Is it ok to make the emulators potentially slow by running > > > > them on busy best-effort cores, or will they quickly be on the > > > > critical path if you do more than just cyclictest? - our > > > > experience says we don't need them reactive even with > > > > rt-networking involved > > > > > > I believe it is ok. > > > > Ok. > > > > > > Our goal is to reach a high packing density of realtime VMs. Our > > > > pragmatic first choice was to run all non-vcpu-threads on a > > > > shared set of pcpus where we also run best-effort VMs and host > > > > load. Now the OpenStack guys are not too happy with that > > > > because that is load outside the assigned resources, which > > > > leads to quota and accounting problems. > > > > > > > > So the current OpenStack model is to run those threads next to > > > > one or more vcpu-threads. [1] You will need to remember that > > > > the vcpus in question should not be your rt-cpus in the guest. > > > > I.e. if vcpu0 shares its pcpu with the hypervisor noise your > > > > preemptrt-guest would use isolcpus=1. > > > > > > > > Is that kind of sharing a pcpu really a good idea? I could > > > > imagine things like smp housekeeping (cache invalidation etc.) > > > > to eventually cause vcpu1 having to wait for the emulator stuck > > > > in IO. > > > > > > Agreed. IIRC, in the beginning of KVM-RT we saw a problem where > > > running vcpu0 on an non-isolated core and without FIFO priority > > > caused spikes in vcpu1. I guess we debugged this down to vcpu1 > > > waiting a few dozen microseconds for vcpu0 for some reason. > > > Running vcpu0 on a isolated core with FIFO priority fixed this > > > (again, this was years ago, I won't remember all the details). > > > > > > > Or maybe a busy polling vcpu0 starving its own emulator causing > > > > high latency or even deadlocks. > > > > > > This will probably happen if you run vcpu0 with FIFO priority. > > > > Two more points that indicate that hw:cpu_realtime_mask (putting > > emulators/io next to any vcpu) does not work for general rt. > > > > > > Even if it happens to work for Linux guests it seems like a > > > > strong assumption that an rt-guest that has noise cores can > > > > deal with even more noise one scheduling level below. > > > > > > > > More recent proposals [2] suggest a scheme where the emulator > > > > and io threads are on a separate core. That sounds more > > > > reasonable / conservative but dramatically increases the per VM > > > > cost. And the pcpus hosting the hypervisor threads will > > > > probably be idle most of the time. > > > > > > I don't know how to solve this problem. Maybe if we dedicate only > > > one core for all emulator threads and io-threads of a VM would > > > mitigate this? Of course we'd have to test it to see if this > > > doesn't give spikes. > > > > [2] suggests exactly that but it is a waste of pcpus. Say a vcpu > > needs 1.0 cores and all other threads need 0.05 cores. The real > > need of a 1 core rt-vm would be 1.05 for two it would be 2.05. > > With [1] we pack 2.05 onto 2 pcpus, that does not work. With [2] we > > need 3 and waste 0.95. > > > > > > I guess in this context the most important question is whether > > > > qemu is ever involved in "regular operation" if you avoid the > > > > obvious IO problems on your critical path. > > > > > > > > My guess is that just [1] has serious hidden latency problems > > > > and [2] is taking it a step too far by wasting whole cores for > > > > idle emulators. We would like to suggest some other way > > > > inbetween, that is a little easier on the core count. Our > > > > current solution seems to work fine but has the mentioned quota > > > > problems. > > > > > > What is your solution? > > > > We have a kilo-based prototype that introduced emulator_pin_set in > > nova.conf. All vcpu threads will be scheduled on vcpu_pin_set and > > emulators and IO of all VMs will share emulator_pin_set. > > vcpu_pin_set contains isolcpus from the host and emulator_pin_set > > contains best-effort cores from the host. > > That basically means you put all emulators and io of all VMs onto a > > set of cores that the host potentially also uses for other stuff. > > Sticking with the made up numbers from above, all the 0.05s can > > share pcpus. > > > > With the current implementation in mitaka (hw:cpu_realtime_mask) you > > can not have a single-core rt-vm because you can not put 1.05 into 1 > > without overcommitting. You can put 2.05 into 2 but as you confirmed > > the overcommitted core could still slow down the truly exclusive > > one. On a 4-core host you get a maximum of 1 rt-VMs (2-3 cores). > > > > With [2], which is not implemented yet, the overcommitting is > > avoided. But now you waste a lot of pcpus. 1.05 = 2, 2.05 = 3 > > On a 4-core host you get a maximum of 1 rt-VMs (1-2 cores). > > > > With our approach it might be hard to account for emulator and > > io-threads because they share pcpus. But you do not run into > > overcommitting and don't waste pcpus at the same time. > > On a 4-core host you get a maximum of 3 rt-VMs (1 core), 1 rt-VMs > > (2-3 cores) > > I think your solution is good. > > In Linux RT context, and as you mentioned, the non-RT vCPU can acquire > some guest kernel lock, then be pre-empted by emulator thread while > holding this lock. This situation blocks RT vCPUs from doing its > work. So that is why we have implemented [2]. For DPDK I don't think > we have such problems because it's running in userland. > > So for DPDK context I think we could have a mask like we have for RT > and basically considering vCPU0 to handle best effort works (emulator > threads, SSH...). I think it's the current pattern used by DPDK users. DPDK is just a library and one can imagine an application that has cross-core communication/synchronisation needs where the emulator slowing down vpu0 will also slow down vcpu1. You DPDK application would have to know which of its cores did not get a full pcpu. I am not sure what the DPDK-example is doing in this discussion, would that not just be cpu_policy=dedicated? I guess normal behaviour of dedicated is that emulators and io happily share pCPUs with vCPUs and you are looking for a way to restrict emulators/io to a subset of pCPUs because you can live with some of them beeing not 100%. > For RT we have to isolate the emulator threads to an additional pCPU > per guests or as your are suggesting to a set of pCPUs for all the > guests running. > > I think we should introduce a new option: > > - hw:cpu_emulator_threads_mask=^1 > > If on 'nova.conf' - that mask will be applied to the set of all host > CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs > running here (useful for RT context). That would allow modelling exactly what we need. In nova.conf we are talking absolute known values, no need for a mask and a set is much easier to read. Also using the same name does not sound like a good idea. And the name vcpu_pin_set clearly suggest what kind of load runs here, if using a mask it should be called pin_set. > If on flavor extra-specs It will be applied to the vCPUs dedicated for > the guest (useful for DPDK context). And if both are present the flavor wins and nova.conf is ignored? Henning > s. > > > Henning > > > > > > With this mail i am hoping to collect some constraints to > > > > derive a suggestion from. Or maybe collect some information > > > > that could be added to the current blueprints as > > > > reasoning/documentation. > > > > > > > > Sorry if you receive this mail a second time, i was not > > > > subscribed to openstack-dev the first time. > > > > > > > > best regards, > > > > Henning > > > > > > > > [1] > > > > https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/libvirt-real-time.html > > > > [2] > > > > https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html > > > > [3] > > > > http://events.linuxfoundation.org/sites/events/files/slides/kvmforum2015-realtimekvm.pdf > > > > > > > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev