>>> On 08.05.17 at 18:15, <chao....@intel.com> wrote: > On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote: >>>>> On 03.05.17 at 12:08, <george.dun...@citrix.com> wrote: >>> On 02/05/17 06:45, Chao Gao wrote: >>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote: >>>>> On 26/04/17 01:52, Chao Gao wrote: >>>>>> I compared the maximum of #entry in one list and #event (adding entry to >>>>>> PI blocking list) with and without the three latter patches. Here >>>>>> is the result: >>>>>> ------------------------------------------------------------- >>>>>> | | | | >>>>>> | Items | Maximum of #entry | #event | >>>>>> | | | | >>>>>> ------------------------------------------------------------- >>>>>> | | | | >>>>>> |W/ the patches | 6 | 22740 | >>>>>> | | | | >>>>>> ------------------------------------------------------------- >>>>>> | | | | >>>>>> |W/O the patches| 128 | 46481 | >>>>>> | | | | >>>>>> ------------------------------------------------------------- >>>>> >>>>> Any chance you could trace how long the list traversal took? It would >>>>> be good for future reference to have an idea what kinds of timescales >>>>> we're talking about. >>>> >>>> Hi. >>>> >>>> I made a simple test to get the time consumed by the list traversal. >>>> Apply below patch and create one hvm guest with 128 vcpus and a >>>> passthrough > 40 NIC. >>>> All guest vcpu are pinned to one pcpu. collect data by >>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by >>>> xentrace_format. When the list length is about 128, the traversal time >>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's >>>> frequence is 1795.788MHz, therefore the time consumed is in the range of >>>> 1us >>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most >>>> 2900 vcpus can be added into the list. >>> >>> Great, thanks Chao Gao, that's useful. >> >>Looks like Chao Gao has been dropped ... >> >>> I'm not sure a fixed latency -- >>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged >>> to have interrupts staggered at 500us intervals it could easily lock up >>> the cpu for nearly a full second. But I'm having trouble formulating a >>> good limit scenario. >>> >>> In any case, 22us should be safe from a security standpoint*, and 128 >>> should be pretty safe from a "make the common case fast" standpoint: >>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up >>> traffic will be the least of your performance problems I should think. >>> >>> -George >>> >>> * Waiting for Jan to contradict me on this one. :-) >> >>22us would certainly be fine, if this was the worst case scenario. >>I'm not sure the value measured for 128 list entries can be easily >>scaled to several thousands of them, due cache and/or NUMA >>effects. I continue to think that we primarily need theoretical >>proof of an upper boundary on list length being enforced, rather >>than any measurements or randomized balancing. And just to be >>clear - if someone overloads their system, I do not see a need to >>have a guaranteed maximum list traversal latency here. All I ask >>for is that list traversal time scales with total vCPU count divided >>by pCPU count. > > Thanks, Jan & George. > > I think it is more clear to me about what should I do next step. > > In my understanding, we should distribute the wakeup interrupts like > this: > 1. By default, distribute it to the local pCPU ('local' means the vCPU > is on the pCPU) to make the common case fast. > 2. With the list grows to a point where we think it may consumers too > much time to traverse the list, also distribute wakeup interrupt to local > pCPU, ignoring admin intentionally overloads their system. > 3. When the list length reachs the theoretic average maximum (means > maximal vCPU count divided by pCPU count), distribute wakeup interrupt > to another underutilized pCPU. > > But, I am confused about that If we don't care someone overload their > system, why we need the stage #3? If not, I have no idea to meet Jan's > request, the list traversal time scales with total vCPU count divided by > pCPU count. Or we will reach stage #3 before stage #2?
Things is that imo point 2 is too fuzzy to be of any use, i.e. 3 should take effect immediately. We don't mean to ignore any admin decisions here, it is just that if they overload their systems, the net effect of 3 may still not be good enough to provide smooth behavior. But that's then a result of them overloading their systems in the first place. IOW, you should try to evenly distribute vCPU-s as soon as their count on a given pCPU exceeds the calculated average. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel