> -----Original Message----- > From: George Dunlap [mailto:george.dun...@citrix.com] > Sent: Wednesday, March 9, 2016 7:25 PM > To: Wu, Feng <feng...@intel.com>; Jan Beulich <jbeul...@suse.com>; George > Dunlap <george.dun...@eu.citrix.com> > Cc: Andrew Cooper <andrew.coop...@citrix.com>; Dario Faggioli > <dario.faggi...@citrix.com>; Tian, Kevin <kevin.t...@intel.com>; xen- > de...@lists.xen.org; Konrad Rzeszutek Wilk <konrad.w...@oracle.com>; Keir > Fraser <k...@xen.org> > Subject: Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt > core logic handling > > On 09/03/16 05:22, Wu, Feng wrote: > > > > > >> -----Original Message----- > >> From: George Dunlap [mailto:george.dun...@citrix.com] > >> Sent: Wednesday, March 9, 2016 1:06 AM > >> To: Jan Beulich <jbeul...@suse.com>; George Dunlap > >> <george.dun...@eu.citrix.com>; Wu, Feng <feng...@intel.com> > >> Cc: Andrew Cooper <andrew.coop...@citrix.com>; Dario Faggioli > >> <dario.faggi...@citrix.com>; Tian, Kevin <kevin.t...@intel.com>; xen- > >> de...@lists.xen.org; Konrad Rzeszutek Wilk <konrad.w...@oracle.com>; Keir > >> Fraser <k...@xen.org> > >> Subject: Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted- > interrupt > >> core logic handling > >> > >> On 08/03/16 15:42, Jan Beulich wrote: > >>>>>> On 08.03.16 at 15:42, <george.dun...@eu.citrix.com> wrote: > >>>> On Tue, Mar 8, 2016 at 1:10 PM, Wu, Feng <feng...@intel.com> wrote: > >>>>>> -----Original Message----- > >>>>>> From: George Dunlap [mailto:george.dun...@citrix.com] > >>>> [snip] > >>>>>> It seems like there are a couple of ways we could approach this: > >>>>>> > >>>>>> 1. Try to optimize the reverse look-up code so that it's not a linear > >>>>>> linked list (getting rid of the theoretical fear) > >>>>> > >>>>> Good point. > >>>>> > >>>>>> > >>>>>> 2. Try to test engineered situations where we expect this to be a > >>>>>> problem, to see how big of a problem it is (proving the theory to be > >>>>>> accurate or inaccurate in this case) > >>>>> > >>>>> Maybe we can run a SMP guest with all the vcpus pinned to a dedicated > >>>>> pCPU, we can run some benchmark in the guest with VT-d PI and without > >>>>> VT-d PI, then see the performance difference between these two > sceanrios. > >>>> > >>>> This would give us an idea what the worst-case scenario would be. > >>> > >>> How would a single VM ever give us an idea about the worst > >>> case? Something getting close to worst case is a ton of single > >>> vCPU guests all temporarily pinned to one and the same pCPU > >>> (could be multi-vCPU ones, but the more vCPU-s the more > >>> artificial this pinning would become) right before they go into > >>> blocked state (i.e. through one of the two callers of > >>> arch_vcpu_block()), the pinning removed while blocked, and > >>> then all getting woken at once. > >> > >> Why would removing the pinning be important? > >> > >> And I guess it's actually the case that it doesn't need all VMs to > >> actually be *receiving* interrupts; it just requires them to be > >> *capable* of receiving interrupts, for there to be a long chain all > >> blocked on the same physical cpu. > >> > >>> > >>>> But > >>>> pinning all vcpus to a single pcpu isn't really a sensible use case we > >>>> want to support -- if you have to do something stupid to get a > >>>> performance regression, then I as far as I'm concerned it's not a > >>>> problem. > >>>> > >>>> Or to put it a different way: If we pin 10 vcpus to a single pcpu and > >>>> then pound them all with posted interrupts, and there is *no* > >>>> significant performance regression, then that will conclusively prove > >>>> that the theoretical performance regression is of no concern, and we > >>>> can enable PI by default. > >>> > >>> The point isn't the pinning. The point is what pCPU they're on when > >>> going to sleep. And that could involve quite a few more than just > >>> 10 vCPU-s, provided they all sleep long enough. > >>> > >>> And the "theoretical performance regression is of no concern" is > >>> also not a proper way of looking at it, I would say: Even if such > >>> a situation would happen extremely rarely, if it can happen at all, > >>> it would still be a security issue. > >> > >> What I'm trying to get at is -- exactly what situation? What actually > >> constitutes a problematic interrupt latency / interrupt processing > >> workload, how many vcpus must be sleeping on the same pcpu to actually > >> risk triggering that latency / workload, and how feasible is it that > >> such a situation would arise in a reasonable scenario? > >> > >> If 200us is too long, and it only takes 3 sleeping vcpus to get there, > >> then yes, there is a genuine problem we need to try to address before we > >> turn it on by default. If we say that up to 500us is tolerable, and it > >> takes 100 sleeping vcpus to reach that latency, then this is something I > >> don't really think we need to worry about. > >> > >> "I think something bad may happen" is a really difficult to work with. > >> "I want to make sure that even a high number of blocked cpus won't cause > >> the interrupt latency to exceed 500us; and I want it to be basically > >> impossible for the interrupt latency to exceed 5ms under any > >> circumstances" is a concrete target someone can either demonstrate that > >> they meet, or aim for when trying to improve the situation. > >> > >> Feng: It should be pretty easy for you to: > > > > George, thanks a lot for you to pointing the possible way to move forward. > > > >> * Implement a modified version of Xen where > >> - *All* vcpus get put on the waitqueue > > > > So this means, all the vcpus are blocked, and hence waiting in the > > blocking list, right? > > No. > > For testing purposes, we need a lot of vcpus on the list, but we only > need one vcpu to actually be woken up to see low long it takes to > traverse the list. > > At the moment, a vcpu will only be put on the list if it has the > arch_block callback defined; and it will have the arch_block callback > defined only if the domain it's a part of has a device assigned to it. > But it would be easy enough to make it so that *all* VMs have the > arch_block callback defined; then all vcpus would end up on the > pi_blocked list when they're blocked, even if they don't have a device > assigned. > > That way you could have a really long pi_blocked list while only needing > a single device to pass through to the guest. > > >> - Measure how long it took to run the loop in pi_wakeup_interrupt > >> * Have one VM receiving posted interrupts on a regular basis. > >> * Slowly increase the number of vcpus blocked on a single cpu (e.g., by > >> creating more guests), stopping when you either reach 500us or 500 > >> vcpus. :-) > > > > This may depends on the environment, I was using a 10G NIC to do the > > test, if we increase the number of guests, I need more NICs to get assigned > > to the guests, I will see if I can get them. > > ...which is why I suggested setting the arch_block() callback for all > domains, even those which don't have devices assigned, so that you could > get away with a single passed-through device. :-)
Oh, I've got your point, thanks a lot for the suggestion! Will try to get the data soon. :) Thanks, Feng > > -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel