Re: [Xen-devel] [PATCH 2/3] credit2: libxl related changes to add support for runqueue per cpupool.
[Trimming the Cc-list a bit] On 9/14/17 7:37 AM, Juergen Gross wrote: On 12/09/17 02:45, anshulmakkar wrote: Introduces scheduler specific parameter at libxl level which are passed on to libxc. eg runqueue for credit2 Signed-off-by: Anshul Makkar <anshulmak...@gmail.com> int libxl_cpupool_destroy(libxl_ctx *ctx, uint32_t poolid); int libxl_cpupool_rename(libxl_ctx *ctx, const char *name, uint32_t poolid); int libxl_cpupool_cpuadd(libxl_ctx *ctx, uint32_t poolid, int cpu); diff --git a/tools/libxl/libxl_cpupool.c b/tools/libxl/libxl_cpupool.c index 85b0688..e3ce7b3 100644 --- a/tools/libxl/libxl_cpupool.c +++ b/tools/libxl/libxl_cpupool.c @@ -130,7 +130,7 @@ int libxl_get_freecpus(libxl_ctx *ctx, libxl_bitmap *cpumap) int libxl_cpupool_create(libxl_ctx *ctx, const char *name, libxl_scheduler sched, libxl_bitmap cpumap, libxl_uuid *uuid, - uint32_t *poolid) + uint32_t *poolid, const libxl_scheduler_params *sched_params) { GC_INIT(ctx); int rc; @@ -138,6 +138,7 @@ int libxl_cpupool_create(libxl_ctx *ctx, const char *name, xs_transaction_t t; char *uuid_string; uint32_t xcpoolid; +xc_schedparam_t xc_sched_param; /* Accept '0' as 'any poolid' for backwards compatibility */ if ( *poolid == LIBXL_CPUPOOL_POOLID_ANY @@ -151,8 +152,18 @@ int libxl_cpupool_create(libxl_ctx *ctx, const char *name, GC_FREE; return ERROR_NOMEM; } +if (sched_params) +{ +xc_sched_param.u.sched_credit2.ratelimit_us = + sched_params->u.credit2.ratelimit_us; +xc_sched_param.u.sched_credit2.runq = sched_params->u.credit2.runqueue; +xc_sched_param.u.sched_credit.tslice_ms = sched_params->u.credit.tslice_ms; +xc_sched_param.u.sched_credit.ratelimit_us = sched_params->u.credit.ratelimit_us; Don't you need some input parameter validation here? Agree. Will perform validation. +} +else +xc_sched_param.u.sched_credit2.runq = LIBXL_CREDIT2_RUNQUEUE_DEFAULT; So you are passing the LIBXL defines down to the hypervisor expecting they match. I think this is a major layering violation. I need to pass the DEFAULT runq arrangement if the user has not selected any option and I want to do it near to the top level (libxc) so that consistency can be maintained at the lower scheduler layer. Please can you suggest alternative that will maintain layering consistency. Juergen anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 2/4] xen: sched: optimize exclusive pinning case (Credit1 & 2)
On 9/15/17 6:35 PM, Dario Faggioli wrote: static unsigned int __read_mostly opt_migrate_resist = 500; integer_param("sched_credit2_migrate_resist", opt_migrate_resist); @@ -1453,6 +1459,26 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) (unsigned char *)); } +/* + * Exclusive pinning is when a vcpu has hard-affinity with only one + * cpu, and there is no other vcpu that has hard-affinity with that + * same cpu. This is infrequent, but if it happens, is for achieving + * the most possible determinism, and least possible overhead for + * the vcpus in question. + * + * Try to identify the vast majority of these situations, and deal + * with them quickly. Sorry, if I have missed to review the earlier series on the same subject. But, I am not completely satisfied with the condition that pinning to a pcpu is only possible if and only if no other vcpu has hard affinity of that pcpu. I think we can do away with this condition and give pinning a priority. Your thoughts please.. + */ +if ( unlikely((new->flags & CSFLAG_pinned) && + cpumask_test_cpu(cpu, >idle) && + !cpumask_test_cpu(cpu, >tickled)) ) +{ +ASSERT(cpumask_cycle(cpu, new->vcpu->cpu_hard_affinity) == cpu); +SCHED_STAT_CRANK(tickled_idle_cpu_excl); +ipid = cpu; +goto tickle; +} + for_each_affinity_balance_step( bs ) { /* Just skip first step, if we don't have a soft affinity */ @@ -2826,6 +2852,19 @@ csched2_dom_cntl( return rc; } +static void +csched2_aff_cntl(const struct scheduler *ops, struct vcpu *v, + const cpumask_t *hard, const cpumask_t *soft) +{ +struct csched2_vcpu *svc = csched2_vcpu(v); + +/* Are we becoming exclusively pinned? */ +if ( cpumask_weight(hard) == 1 ) +__set_bit(__CSFLAG_pinned, >flags); +else +__clear_bit(__CSFLAG_pinned, >flags); +} + static int csched2_sys_cntl(const struct scheduler *ops, struct xen_sysctl_scheduler_op *sc) Looks fine. Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 0/4] xen: sched: optimize exclusive pinning and soft-affinity checking
On 9/15/17 6:35 PM, Dario Faggioli wrote: Hello, Dario Faggioli (4): xen: sched: introduce 'adjust_affinity' hook. xen: sched: optimize exclusive pinning case (Credit1 & 2) xen: sched: improve checking soft-affinity xen: sched: simplify (and speedup) checking soft-affinity xen/arch/x86/dom0_build.c|4 + xen/common/sched_credit.c| 114 +++--- xen/common/sched_credit2.c | 50 -- xen/common/sched_null.c |8 +-- xen/common/schedule.c| 67 +++-- xen/include/xen/perfc_defn.h |1 xen/include/xen/sched-if.h | 16 +++--- xen/include/xen/sched.h |5 ++ 8 files changed, 188 insertions(+), 77 deletions(-) -- Please can you share the linkĀ for the branch on which these changes have been done . Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 1/4] xen: credit2: implement utilization cap
On 8/18/17 4:50 PM, Dario Faggioli wrote: @@ -474,6 +586,12 @@ static inline struct csched2_runqueue_data *c2rqd(const struct scheduler *ops, return _priv(ops)->rqd[c2r(cpu)]; } +/* Does the domain of this vCPU have a cap? */ +static inline bool has_cap(const struct csched2_vcpu *svc) +{ +return svc->budget != STIME_MAX; +} + /* * Hyperthreading (SMT) support. * @@ -1515,7 +1633,16 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now, * that the credit it has spent so far get accounted. */ if ( svc->vcpu == curr_on_cpu(svc_cpu) ) +{ burn_credits(rqd, svc, now); +/* + * And, similarly, in case it has run out of budget, as a + * consequence of this round of accounting, we also must inform + * its pCPU that it's time to park it, and pick up someone else. + */ +if ( unlikely(svc->budget <= 0) ) +tickle_cpu(svc_cpu, rqd); This is for accounting of credit. Why it willl impact the budget. Do you intend to refer that budget of current vcpu expired while doing calculation for credit ?? +} start_credit = svc->credit; @@ -1571,27 +1698,35 @@ void burn_credits(struct csched2_runqueue_data *rqd, delta = now - svc->start_time; -if ( likely(delta > 0) ) -{ -SCHED_STAT_CRANK(burn_credits_t2c); -t2c_update(rqd, delta, svc); -svc->start_time = now; -} -else if ( delta < 0 ) +if ( unlikely(delta <= 0) ) { +static void replenish_domain_budget(void* data) +{ +struct csched2_dom *sdom = data; +unsigned long flags; +s_time_t now; +LIST_HEAD(parked); + +spin_lock_irqsave(>budget_lock, flags); + +now = NOW(); + +/* + * Let's do the replenishment. Note, though, that a domain may overrun, + * which means the budget would have gone below 0 (reasons may be system + * overbooking, accounting issues, etc.). It also may happen that we are + * handling the replenishment (much) later than we should (reasons may + * again be overbooking, or issues with timers). + * + * Even in cases of overrun or delay, however, we expect that in 99% of + * cases, doing just one replenishment will be good enough for being able + * to unpark the vCPUs that are waiting for some budget. + */ +do_replenish(sdom); + +/* + * And now, the special cases: + * 1) if we are late enough to have skipped (at least) one full period, + * what we must do is doing more replenishments. Note that, however, + * every time we add tot_budget to the budget, we also move next_repl + * away by CSCHED2_BDGT_REPL_PERIOD, to make sure the cap is always + * respected. + */ +if ( unlikely(sdom->next_repl <= now) ) +{ +do +do_replenish(sdom); +while ( sdom->next_repl <= now ); +} Just a bit confused. Have you seen this kind of scenario. Please can you explain it. Is this condition necessary. +/* + * 2) if we overrun by more than tot_budget, then budget+tot_budget is + * still < 0, which means that we can't unpark the vCPUs. Let's bail, + * and wait for future replenishments. + */ +if ( unlikely(sdom->budget <= 0) ) +{ +spin_unlock_irqrestore(>budget_lock, flags); +goto out; +} "if we overran by more than tot_budget in previous run", make is more clear.. + +/* Since we do more replenishments, make sure we didn't overshot. */ +sdom->budget = min(sdom->budget, sdom->tot_budget); + +/* + * As above, let's prepare the temporary list, out of the domain's + * parked_vcpus list, now that we hold the budget_lock. Then, drop such + * lock, and pass the list to the unparking function. + */ +list_splice_init(>parked_vcpus, ); + +spin_unlock_irqrestore(>budget_lock, flags); + +unpark_parked_vcpus(sdom->dom->cpupool->sched, ); + + out: +set_timer(sdom->repl_timer, sdom->next_repl); +} + #ifndef NDEBUG static inline void csched2_vcpu_check(struct vcpu *vc) @@ -1658,6 +2035,9 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd) } svc->tickled_cpu = -1; + Rest, looks good to me. Thanks Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 2/7] xen/tools: credit2: soft-affinity awareness in runq_tickle()
On 16/06/2017 15:13, Dario Faggioli wrote: Soft-affinity support is usually implemented by means of a two step "balancing loop", where: - during the first step, we consider soft-affinity (if the vcpu has one); - during the second (if we get to it), we consider hard-affinity. In runq_tickle(), we need to do that for checking whether we can execute the waking vCPU on an pCPU that is idle. In fact, we want to be sure that, if there is an idle pCPU in the vCPU's soft affinity, we'll use it. If there are no such idle pCPUs, though, and we have to check non-idle ones, we can avoid the loop and to both hard and soft-affinity in one pass. In fact, we can we scan runqueue and compute a "score" for each vCPU which is running on each pCPU. The idea is, since we may have to preempt someone: - try to make sure that the waking vCPU will run inside its soft-affinity, - try to preempt someone that is running outside of its own soft-affinity. The value of the score is added to a trace record, so xenalyze's code and tools/xentrace/formats are updated accordingly. Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> --- Cc: George Dunlap <george.dun...@citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> Cc: Ian Jackson <ian.jack...@eu.citrix.com> Cc: Wei Liu <wei.l...@citrix.com> --- reviewed. Looks good. Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 1/7] xen: sched: factor affinity helpers out of sched_credit.c
On 16/06/2017 15:13, Dario Faggioli wrote: In fact, we want to be able to use them from any scheduler. While there, make the moved code use 'v' for struct_vcpu* variable, like it should be done everywhere. No functional change intended. Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> Signed-off-by: Justin T. Weaver <jtwea...@hawaii.edu> Reviewed-by: George Dunlap <george.dun...@citrix.com> --- Cc: Anshul Makkar <anshul.mak...@citrix.com> --- xen/common/sched_credit.c | 97 +++- xen/include/xen/sched-if.h | 64 + 2 files changed, 79 insertions(+), 82 deletions(-) diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index efdf6bf..53773df 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -136,27 +136,6 @@ * + * Hard affinity balancing is always necessary and must never be skipped. + * But soft affinity need only be considered when it has a functionally + * different effect than other constraints (such as hard affinity, cpus + * online, or cpupools). + * + * Soft affinity only needs to be considered if: + * * The cpus in the cpupool are not a subset of soft affinity do you mean cpus in a cpupool are not in a subset of the soft affinity of a vcpu which is a runnable state ? + * * The hard affinity is not a subset of soft affinity + * * There is an overlap between the soft affinity and the mask which is + * currently being considered. + */ reviewed !! looks good. Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap
On 13/06/2017 22:13, Dario Faggioli wrote: On Tue, 2017-06-13 at 17:07 +0100, Anshul Makkar wrote: On 12/06/2017 14:19, Dario Faggioli wrote: @@ -92,6 +92,82 @@ */ what I want to ask is that if the budget of the domain is replenished, but credit for the vcpus of that domain is not available, then what happens. Yes, but the point is that budget can be available or not, while credits are always available. There's no such thing as credit not being available at all. The amount of credits each vcpu has decides which vcpu will run, in the sense that it will be the one that has the highest amount of credits. The others will indeed wait, but because they've got less credit than the one that runs, not because they don't have credits available. Ok, as discussed, credits are resetted if it reaches 0. In that sense its being considered "always available".. I believe, vcpus won't be scheduled (even if they have budget_quota) till they get their credit replenished. Credits are not exhausted or replenished. But... it's already totally dynamic. csched2_dom_cntl() { svc->budget_quota = max(sdom->tot_budget / sdom->nr_vcpus, CSCHED2_MIN_TIMER); } If domain->tot_budge = 200 nr_cpus is 4, then each cpu gets 50%. How this is dynamic allocation ? We are not considering vcpu utilization of other vcpus of domain before allocating budget_quota for some vcpu. Right. Well, what this means is that each vcpu will get budget in chunks of tot_budget/nr_vcpus. But then, how much budget each vcpu will actually be able to get and consume in each period, it's impossible to know in advance, as it will depend on overall system load, and the behavior of the various vcpus of the domain. Yeah, the current implementation is dynamic in the sense that vcpu can execute in total more than its quota across multiple budget cycles, but its static in the sense that vcpu can't execute more than its budget quota in a single budget cycle.s In runq candidate we have a code base Regards, Dario Reviewed-by: Anshul Makkar<anshul.mak...@citrix.com> Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap
On 12/06/2017 14:19, Dario Faggioli wrote: Hey, Budget is burned by the domain's vCPUs, in a similar way to how credits are. When a domain runs out of budget, its vCPUs can't run any longer. if the vcpus of a domain have credit and if budget has run out, will the vcpus won't be scheduled. Is this a question? Assuming it is, what do you mean with "domain have credit"? Domains always have credits, and they never run out of them. There's no such thing as a domain not being runnable because it does not have credits. "domain have credit" ? the statement is if the "vcpus of domain have credit". About budget, a domain with <= 0 budget means all its vcpus are not runnable, and hence won't be scheduler, no matter their credits. You answered the question here. @@ -92,6 +92,82 @@ */ /* + * Utilization cap: + * + * Setting an pCPU utilization cap for a domain means the following: + * + * - a domain can have a cap, expressed in terms of % of physical + * For implementing this, we use the following approach: + * + * - each domain is given a 'budget', an each domain has a timer, which + * replenishes the domain's budget periodically. The budget is the amount + * of time the vCPUs of the domain can use every 'period'; + * + * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for all domains + * (but each domain has its own timer; so the all are periodic by the same + * period, but replenishment of the budgets of the various domains, at + * periods boundaries, are not synchronous); + * + * - when vCPUs run, they consume budget. When they don't run, they don't + * consume budget. If there is no budget left for the domain, no vCPU of + * that domain can run. If a vCPU tries to run and finds that there is no + * budget, it blocks. + * Budget never expires, so at whatever time a vCPU wants to run, it can + * check the domain's budget, and if there is some, it can use it. + * + * - budget is replenished to the top of the capacity for the domain once + * per period. Even if there was some leftover budget from previous period, + * though, the budget after a replenishment will always be at most equal + * to the total capacify of the domain ('tot_budget'); + * budget is replenished but credits not available ? Still not getting this. what I want to ask is that if the budget of the domain is replenished, but credit for the vcpus of that domain is not available, then what happens. I believe, vcpus won't be scheduled (even if they have budget_quota) till they get their credit replenished. budget is finished but not vcpu has not reached the rate limit boundary ? Budget takes precedence over ratelimiting. This is important to keep cap working "regularly", rather then in some kind of permanent "trying- to-keep-up-with-overruns-in-previous-periods" state. And, ideally, a vcpu cap and ratelimiting should be set in such a way that they don't step on each other toe (or do that only rarely). I can see about trying to print a warning when I detect potential tricky values (but it's not easy, considering budget is per-domain, so I can't be sure about how much each vcpu will actually get, and whether or not why you can't be sure. Scheduler know the domain budget, number of vcpus per domain and we can calculate the budget_quota and translate it into cpu slot duration. Similarly , the value of rate limit is also known. We can compare and give a warning to the user if the budget_quota is less than rate limit. This is very important for the user to know, if wrongly chosen, it can adversely affect the system's performance with frequent context switches. (the problem we are aware of). that will reveal to be significantly less than ratelimiting the most of the times). + * - when a budget replenishment occurs, if there are vCPUs that had been + * blocked because of lack of budget, they'll be unblocked, and they will + * (potentially) be able to run again. + * + * Finally, some even more implementation related detail: + * + * - budget is stored in a domain-wide pool. vCPUs of the domain that want + * to run go to such pool, and grub some. When they do so, the amount + * they grabbed is _immediately_ removed from the pool. This happens in + * vcpu_try_to_get_budget(); + * + * - when vCPUs stop running, if they've not consumed all the budget they + * took, the leftover is put back in the pool. This happens in + * vcpu_give_budget_back(); 200% budget, 4 vcpus to run on 4 pcpus each allowed only 50% of budget. This is a static allocation . Err... again, are you telling or asking? giving an example to prove its a static allocation. for eg. 2 vcpus running on 2 pvpus at 20% budgeted time, if vcpu3 wants to execute some cpu intensive task, then it won't be allowed to allowed to use more than 50% of the pcpus. With what parameters? You mean with these ones you cite here (i.e., a 200% budget)? If the VM has 200%, and vcpu1 and vcpu2 consumes 20%+20%=40%,
Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap
On 08/06/2017 13:08, Dario Faggioli wrote: This commit implements the Xen part of the cap mechanism for Credit2. A cap is how much, in terms of % of physical CPU time, a domain can execute at most. For instance, a domain that must not use more than 1/4 of one physical CPU, must have a cap of 25%; one that must not use more than 1+1/2 of physical CPU time, must be given a cap of 150%. Caps are per domain, so it is all a domain's vCPUs, cumulatively, that will be forced to execute no more than the decided amount. This is implemented by giving each domain a 'budget', and using a (per-domain again) periodic timer. Values of budget and 'period' are chosen so that budget/period is equal to the cap itself. Budget is burned by the domain's vCPUs, in a similar way to how credits are. When a domain runs out of budget, its vCPUs can't run any longer. if the vcpus of a domain have credit and if budget has run out, will the vcpus won't be scheduled. They can gain, when the budget is replenishment by the timer, which event happens once every period. Blocking the vCPUs because of lack of budget happens by means of a new (_VPF_parked) pause flag, so that, e.g., vcpu_runnable() still works. This is similar to what is done in sched_rtds.c, as opposed to what happens in sched_credit.c, where vcpu_pause() and vcpu_unpause() (which means, among other things, more overhead). Note that xenalyze and tools/xentrace/format are also modified, to keep them updated with one modified event. Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> --- Cc: George Dunlap <george.dun...@eu.citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> Cc: Andrew Cooper <andrew.coop...@citrix.com> Cc: Jan Beulich <jbeul...@suse.com> Cc: Ian Jackson <ian.jack...@eu.citrix.com> Cc: Wei Liu <wei.l...@citrix.com> --- tools/xentrace/formats |2 tools/xentrace/xenalyze.c | 10 + xen/common/sched_credit2.c | 470 +--- xen/include/xen/sched.h|3 4 files changed, 445 insertions(+), 40 deletions(-) diff --git a/tools/xentrace/formats b/tools/xentrace/formats index 8b31780..142b0cf 100644 --- a/tools/xentrace/formats +++ b/tools/xentrace/formats @@ -51,7 +51,7 @@ 0x00022201 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tick 0x00022202 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_pos [ dom:vcpu = 0x%(1)08x, pos = %(2)d] -0x00022203 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit burn[ dom:vcpu = 0x%(1)08x, credit = %(2)d, delta = %(3)d ] +0x00022203 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit burn[ dom:vcpu = 0x%(1)08x, credit = %(2)d, budget = %(3)d, delta = %(4)d ] 0x00022204 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit_add 0x00022205 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tickle_check [ dom:vcpu = 0x%(1)08x, credit = %(2)d ] 0x00022206 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tickle [ cpu = %(1)d ] diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c index fa608ad..c16c02d 100644 --- a/tools/xentrace/xenalyze.c +++ b/tools/xentrace/xenalyze.c @@ -7680,12 +7680,14 @@ void sched_process(struct pcpu_info *p) if(opt.dump_all) { struct { unsigned int vcpuid:16, domid:16; -int credit, delta; +int credit, budget, delta; } *r = (typeof(r))ri->d; -printf(" %s csched2:burn_credits d%uv%u, credit = %d, delta = %d\n", - ri->dump_header, r->domid, r->vcpuid, - r->credit, r->delta); +printf(" %s csched2:burn_credits d%uv%u, credit = %d, ", + ri->dump_header, r->domid, r->vcpuid, r->credit); +if ( r->budget != INT_MIN ) +printf("budget = %d, ", r->budget); +printf("delta = %d\n", r->delta); } break; case TRC_SCHED_CLASS_EVT(CSCHED2, 5): /* TICKLE_CHECK */ diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 126417c..ba4bf4b 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -92,6 +92,82 @@ */ /* + * Utilization cap: + * + * Setting an pCPU utilization cap for a domain means the following: + * + * - a domain can have a cap, expressed in terms of % of physical CPU time. + * A domain that must not use more than 1/4 of _one_ physical CPU, will + * be given a cap of 25%; a domain that must not use more than 1+1/2 of + * physical CPU time, will be given a cap of 150%; + * + * - caps are per-domain (not per-vCPU). If a domain has only 1 vCPU, and + * a 40% cap, that one vCPU will use 40% of one pCPU. If a somain has 4 + * vCPUs, and a 200% cap, all its 4 vCPUs are allowed to run for (the + * equivalent of) 100% time on 2 pCPUs. How much eac
Re: [Xen-devel] DESIGN: CPUID part 3
On 08/06/2017 14:12, Andrew Cooper wrote: Presented herewith is the a plan for the final part of CPUID work, which primarily covers better Xen/Toolstack interaction for configuring the guests CPUID policy. A PDF version of this document is available from: http://xenbits.xen.org/people/andrewcoop/cpuid-part-3.pdf There are a number of still-open questions, which I would appreaciate views on. ~Andrew # Proposal First and foremost, split the current **max\_policy** notion into separate **max** and **default** policies. This allows for the provision of features which are unused by default, but may be opted in to, both at the hypervisor level and the toolstack level. At the hypervisor level, **max** constitutes all the features Xen can use on the current hardware, while **default** is the subset thereof which are supported features, the features which the user has explicitly opted in to, and excluding any features the user has explicitly opted out of. A new `cpuid=` command line option shall be introduced, whose internals are generated automatically from the featureset ABI. This means that all features added to `include/public/arch-x86/cpufeatureset.h` automatically gain command line control. (RFC: The same top level option can probably be used for non-feature CPUID data control, although I can't currently think of any cases where this would be used Also find a sensible way to express 'available but not to be used by Xen', as per the current `smep` and `smap` options.) At the guest level, **max** constitutes all the features which can be offered to each type of guest on this hardware. Derived from Xen's **default** policy, it includes the supported features and explicitly opted in to features, which are appropriate for the guest. The guests **default** policy is then derived from its **max**, and includes the supported features which are considered migration safe. (RFC: This distinction is rather fuzzy, but for example it wouldn't include things like ITSC by default, as that is likely to go wrong unless special care is taken.) Just from other perspective, what happens to the features which have been explicilty selected and are not migration safe ? Do, we consider them in guest's default policy. All global policies (Xen and guest, max and default) shall be made available to the toolstack, in a manner similar to the existing Instead of all, do you see any harm if we expose only the default policies of Xen and Guest to toolstack. _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism. This allows decisions to be taken which include all CPUID data, not just the feature bitmaps. New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced, which allows the toolstack to query and set the cpuid policy for a specific domain. It shall supersede _XEN\_DOMCTL\_set\_cpuid_, shall fail if Xen is unhappy with any aspect of the policy during auditing. When a domain is initially created, the appropriate guests **default** policy is duplicated for use. When auditing, Xen shall audit the toolstacks requested policy against the guests **max** policy. This allows experimental features or non-migration-safe features to be opted in to, without those features being imposed upon all guests automatically. The `disable_migrate` field shall be dropped. The concept of migrateability is not boolean; it is a large spectrum, all of which needs to be managed by the toolstack. Can't this large spectrum result in a bool which can then be used for disable_migrate. Sorry, I can't see any value add in removing disable_migrate. The simple case is picking the common subset of features between the source and destination. This becomes more complicated e.g. if the guest uses LBR/LER, at which point the toolstack needs to consider hardware with the same LBR/LER format in addition to just the plain features. `disable_migrate` is currently only used to expose ITSC to guests, but there are cases where is perfectly safe to migrate such a guest, if the destination host has the same TSC frequency or hardware TSC scaling support. Finally, `disable_migrate` doesn't (and cannot reasonably) be used to inhibit state gather operations, as this interferes with debugging and monitoring tasks. Thanks Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [[PATCH -v2]] XenBus: Don't wait for the producer to fill the ring if
The condition: if there is a space in the ring then wait for the producer to fill the ring also evaluates to true even if the ring if full. It leads to a deadlock where producer is waiting for consumer to consume the items and consumer is waiting for producer to fill the ring. Fix for the issue: check if the ring is full and then break from the loop to consume the items from the ring. eg. case: prod = 1272, cons = 248. Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- v2: * resolved the coding style issue. * modified the "if" condition statement to make it simpler. tools/firmware/hvmloader/xenbus.c | 12 1 file changed, 12 insertions(+) diff --git a/tools/firmware/hvmloader/xenbus.c b/tools/firmware/hvmloader/xenbus.c index 448157d..2b89a56 100644 --- a/tools/firmware/hvmloader/xenbus.c +++ b/tools/firmware/hvmloader/xenbus.c @@ -141,7 +141,19 @@ static void ring_read(char *data, uint32_t len) /* Don't overrun the producer pointer */ while ( (part = MASK_XENSTORE_IDX(rings->rsp_prod - rings->rsp_cons)) == 0 ) +{ +/* + * Don't wait for producer to fill the ring if it is already full. + * Condition happens when you write string > 1K into the ring. + * eg case prod=1272 cons=248. + */ +if ( rings->rsp_prod - rings->rsp_cons == XENSTORE_RING_SIZE ) +{ +part = XENSTORE_RING_SIZE; +break; +} ring_wait(); +} /* Don't overrun the end of the ring */ if ( part > (XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons)) ) part = XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons); -- 2.7.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] XenBus: Don't wait for producer to fill the ring if the ring
On 17/05/2017 16:56, Jan Beulich wrote: On 17.05.17 at 16:57, <anshul.mak...@citrix.com> wrote: The condition to check for if there is space in the ring buffer also becomes true if the buffer is full, thus consumer waits for the producer to fill the buffer eventhough it is already full. To resolve the situation, check if the buffer is full and then break from the loop. e.g case: prod = 1272, cons = 248. Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> Please avoid indenting the entire commit message. Oh. sorry about that. I saw this format in few previous commits and adopted it. Will correct it. --- a/tools/firmware/hvmloader/xenbus.c +++ b/tools/firmware/hvmloader/xenbus.c @@ -141,7 +141,18 @@ static void ring_read(char *data, uint32_t len) /* Don't overrun the producer pointer */ while ( (part = MASK_XENSTORE_IDX(rings->rsp_prod - rings->rsp_cons)) == 0 ) +{ +/* don't wait for producer to fill the ring if it is already full. + * Condition happens when you write string > 1K into the ring. + * eg case prod=1272 cons=248. + */ Comment style. +if ( !(XENSTORE_RING_SIZE - (rings->rsp_prod - rings->rsp_cons)) ) Is this any different from if ( rings->rsp_prod - rings->rsp_cons == XENSTORE_RING_SIZE ) No, its same. Ok, will use the suggested approach. ? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] XenBus: Don't wait for producer to fill the ring if the ring
The condition to check for if there is space in the ring buffer also becomes true if the buffer is full, thus consumer waits for the producer to fill the buffer eventhough it is already full. To resolve the situation, check if the buffer is full and then break from the loop. e.g case: prod = 1272, cons = 248. Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- tools/firmware/hvmloader/xenbus.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/tools/firmware/hvmloader/xenbus.c b/tools/firmware/hvmloader/xenbus.c index 448157d..f8fd730 100644 --- a/tools/firmware/hvmloader/xenbus.c +++ b/tools/firmware/hvmloader/xenbus.c @@ -141,7 +141,18 @@ static void ring_read(char *data, uint32_t len) /* Don't overrun the producer pointer */ while ( (part = MASK_XENSTORE_IDX(rings->rsp_prod - rings->rsp_cons)) == 0 ) +{ +/* don't wait for producer to fill the ring if it is already full. + * Condition happens when you write string > 1K into the ring. + * eg case prod=1272 cons=248. + */ +if ( !(XENSTORE_RING_SIZE - (rings->rsp_prod - rings->rsp_cons)) ) +{ +part = XENSTORE_RING_SIZE; +break; +} ring_wait(); +} /* Don't overrun the end of the ring */ if ( part > (XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons)) ) part = XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons); -- 2.7.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 2/6] xen: credit: (micro) optimize csched_runq_steal().
On 02/03/17 10:38, Dario Faggioli wrote: Chacking whether or not a vCPU can be 'stolen' from a peer pCPU's runqueue is relatively cheap. Therefore, let's do that as early as possible, avoiding potentially useless complex checks, and cpumask manipulations. Signed-off-by: Dario Faggioli--- Cc: George Dunlap --- xen/common/sched_credit.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index 63a8675..2b13e99 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -708,12 +708,10 @@ static inline int __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask) { /* - * Don't pick up work that's in the peer's scheduling tail or hot on - * peer PCPU. Only pick up work that prefers and/or is allowed to run - * on our CPU. + * Don't pick up work that's or hot on peer PCPU, or that can't (or Not clear. + * would prefer not to) run on cpu. */ -return !vc->is_running && - !__csched_vcpu_is_cache_hot(vc) && +return !__csched_vcpu_is_cache_hot(vc) && cpumask_test_cpu(dest_cpu, mask); !vc->is_running doesn't ease the complexity and doesn't save much on cpu cycles. Infact, I think (!vc->is_running) makes the intention to check for migration much more clear to understand. Yeah, apart from the above reasons, its save to remove this check from here. } @@ -1622,7 +1620,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) BUG_ON( is_idle_vcpu(vc) ); /* - * If the vcpu has no useful soft affinity, skip this vcpu. + * If the vcpu is still in peer_cpu's scheduling tail, or if it + * has no useful soft affinity, skip it. + * * In fact, what we want is to check if we have any "soft-affine * work" to steal, before starting to look at "hard-affine work". * @@ -1633,8 +1633,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) * vCPUs with useful soft affinities in some sort of bitmap * or counter. */ -if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY - && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) ) +if ( vc->is_running || + (balance_step == CSCHED_BALANCE_SOFT_AFFINITY + && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity)) ) continue; csched_balance_cpumask(vc, balance_step, cpumask_scratch); ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit.
On 02/03/17 10:38, Dario Faggioli wrote: Since we're holding the lock on the pCPU from which we are trying to steal, it can't have disappeared, so we can drop the check for that (and convert it in an ASSERT()). And since we try to steal only from busy pCPUs, it's unlikely for such pCPU to be idle, so we mark it as such (and bail early if it unfortunately is). Signed-off-by: Dario Faggioli--- Cc: George Dunlap --- xen/common/sched_credit.c | 87 +++-- 1 file changed, 44 insertions(+), 43 deletions(-) diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index 4649e64..63a8675 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -1593,64 +1593,65 @@ static struct csched_vcpu * csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) { const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu); -const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu); struct csched_vcpu *speer; struct list_head *iter; struct vcpu *vc; +ASSERT(peer_pcpu != NULL); + /* * Don't steal from an idle CPU's runq because it's about to * pick up work from it itself. */ -if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) ) +if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) ) +goto out; We can just use if (!is_idle_vcpu(peer_vcpu)). Why to replace it with some code that introduces an unnecessary branch statement. + +list_for_each( iter, _pcpu->runq ) { -list_for_each( iter, _pcpu->runq ) -{ -speer = __runq_elem(iter); +speer = __runq_elem(iter); -/* - * If next available VCPU here is not of strictly higher - * priority than ours, this PCPU is useless to us. - */ -if ( speer->pri <= pri ) -break; +/* + * If next available VCPU here is not of strictly higher + * priority than ours, this PCPU is useless to us. + */ +if ( speer->pri <= pri ) +break; -/* Is this VCPU runnable on our PCPU? */ -vc = speer->vcpu; -BUG_ON( is_idle_vcpu(vc) ); +/* Is this VCPU runnable on our PCPU? */ +vc = speer->vcpu; +BUG_ON( is_idle_vcpu(vc) ); -/* - * If the vcpu has no useful soft affinity, skip this vcpu. - * In fact, what we want is to check if we have any "soft-affine - * work" to steal, before starting to look at "hard-affine work". - * - * Notice that, if not even one vCPU on this runq has a useful - * soft affinity, we could have avoid considering this runq for - * a soft balancing step in the first place. This, for instance, - * can be implemented by taking note of on what runq there are - * vCPUs with useful soft affinities in some sort of bitmap - * or counter. - */ -if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY - && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) ) -continue; +/* + * If the vcpu has no useful soft affinity, skip this vcpu. + * In fact, what we want is to check if we have any "soft-affine + * work" to steal, before starting to look at "hard-affine work". + * + * Notice that, if not even one vCPU on this runq has a useful + * soft affinity, we could have avoid considering this runq for + * a soft balancing step in the first place. This, for instance, + * can be implemented by taking note of on what runq there are + * vCPUs with useful soft affinities in some sort of bitmap + * or counter. + */ Isn't it a better approach that now as we have came across a vcpu which doesn't have a desired soft affinity but is a potential candidate for migration, so instead of just forgetting it, lets save the information for such vcpus in some data structure in some order so that we don't have to scan them again if we don't find anything useful in the present run. +if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY + && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) ) +continue; -csched_balance_cpumask(vc, balance_step, cpumask_scratch); -if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) ) -{ -/* We got a candidate. Grab it! */ -TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu, - vc->domain->domain_id, vc->vcpu_id); -SCHED_VCPU_STAT_CRANK(speer, migrate_q); -SCHED_STAT_CRANK(migrate_queued); -WARN_ON(vc->is_urgent); -__runq_remove(speer); -vc->processor =
Re: [Xen-devel] XenGT GPU virtualization
On 18/01/17 13:21, bharat gohil wrote: Hello I am new to GPU and GPU virtualization and found that xen support intel GPU virtualization using XenGT. I want to know, 1) What are the critical GPU command pass from xen to Dom0? 2) How the Dom0 mediator or xen validate the GPU command which is passed from domU GPU driver? 3) If one of the domU guest send bad(malicious) command to GPU which led GPU to bad state. Can Dom0 mediator or xen prevents this kind of scenario? As far as I know, there is know mediation to check for the commands. Xen does audit the target address space, but not GPU commands. -- Regards, Bharat Gohil ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] PML causing race condition during guest bootstorm and host crash on Broadwell cpu.
Hi Jan, On 09/02/17 16:22, Jan Beulich wrote: On 08.02.17 at 16:32,wrote: On 07.02.17 at 18:26, wrote: Facing a issue where bootstorm of guests leads to host crash. I debugged and found that that enabling PML introduces a race condition during guest teardown stage while disabling PML on a vcpu and context switch happening for another vcpu. Crash happens only on Broadwell processors as PML got introduced in this generation. Here is my analysis: Race condition: vmcs.c vmx_vcpu_disable_pml (vcpu){ vmx_vmcs_enter() ; vm_write( disable_PML); vmx_vmcx_exit();) In between I have a callpath from another pcpu executing context switch-> vmx_fpu_leave() and crash on vmwrite.. if ( !(v->arch.hvm_vmx.host_cr0 & X86_CR0_TS) ) { v->arch.hvm_vmx.host_cr0 |= X86_CR0_TS; __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); //crash } So that's after current has changed already, so it's effectively dealing with a foreign VMCS, but it doesn't use vmx_vmcs_enter(). The locking done in vmx_vmcs_try_enter() / vmx_vmcs_exit(), however, assumes that any user of a VMCS either owns the lock or has current as the owner of the VMCS. Of course such a call also can't be added here, as a vcpu on the context-switch-from path can't vcpu_pause() itself. That taken together with the bypassing of __context_switch() when the incoming vCPU is the idle one (which means that via context_saved() ->is_running will be cleared before running ->ctxt_switch_from()), the vcpu_pause() invocation in vmx_vmcs_try_enter() may not have to wait at all if the call happens between the clearing of ->is_running and the eventual invocation of vmx_ctxt_switch_from(). Mind giving the attached patch a try (which admittedly was only lightly tested so far - in particular I haven't seen the second of the debug messages being logged, yet)? Patch looks promising. I couldn't do much thorough testing, but initial reboot cycles (around 20 reboots of 32 VMS) went fine. Jan Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] PML causing race condition during guest bootstorm and host crash on Broadwell cpu.
May be "causes". But not sure, as disabling this feature solves the issue (no hypervisor crash) and we have reports that the same code base works fine on Haswell (again no hardware support for PML in Haswell, but code base is same. But inherently it leads to non-execution of PML code path, so no crash) Anshul On 08/02/17 15:37, Jan Beulich wrote: On 08.02.17 at 15:58,wrote: Code is trying to destroy multiple vcpus held by the domain: complete_domain_destroy->hvm_vcpu_destroy() for each vcpu. In vmx_vcpu_destroy, we have a call for vmx_vcpu_disable_pml which leads to a race while destroying foreign vcpu's with other domains rebooting on the same vcpus . With a single domain reboot, no race is observed. Commit e18d4274772e52ac81fda1acb246d11ef666e5fe causes this race condition. Causes or exposes? I ask because the similar reports we have are all on 4.4.x, i.e. no PML code there at all. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] PML causing race condition during guest bootstorm and host crash on Broadwell cpu.
Hi, Code is trying to destroy multiple vcpus held by the domain: complete_domain_destroy->hvm_vcpu_destroy() for each vcpu. In vmx_vcpu_destroy, we have a call for vmx_vcpu_disable_pml which leads to a race while destroying foreign vcpu's with other domains rebooting on the same vcpus . With a single domain reboot, no race is observed. Commit e18d4274772e52ac81fda1acb246d11ef666e5fe causes this race condition. Anshul On 07/02/17 17:58, anshul makkar wrote: Hi, Sorry, forgot to include you in cc list. Anshul On 07/02/17 17:26, anshul makkar wrote: Hi, Facing a issue where bootstorm of guests leads to host crash. I debugged and found that that enabling PML introduces a race condition during guest teardown stage while disabling PML on a vcpu and context switch happening for another vcpu. Crash happens only on Broadwell processors as PML got introduced in this generation. Here is my analysis: Race condition: vmcs.c vmx_vcpu_disable_pml (vcpu){ vmx_vmcs_enter() ; vm_write( disable_PML); vmx_vmcx_exit();) In between I have a callpath from another pcpu executing context switch-> vmx_fpu_leave() and crash on vmwrite.. if ( !(v->arch.hvm_vmx.host_cr0 & X86_CR0_TS) ) { v->arch.hvm_vmx.host_cr0 |= X86_CR0_TS; __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); //crash } Debug logs XEN) [221256.749928] VMWRITE VMCS Invalid ! (XEN) [221256.754870] **[00] { now c93b4341df1d, hw 0035fffea000, op 0035fffea000 } vmclear (XEN) [221256.765052] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221256.773969] **[01] { now c93b4341e099, hw , op 0035fffea000 } vmptrld (XEN) [221256.784150] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221256.792197] **[02] { now c93b4341e1f1, hw 0035fffea000, op 0035fffea000 } vmclear (XEN) [221256.802378] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221256.811298] **[03] { now c93b5784dd0a, hw , op 0039d7074000 } vmptrld (XEN) [221256.821478] ** frames [ 82d0801f4c31 vmx_do_resume+0x51/0x150 ] (XEN) [221256.829139] **[04] { now c93b59d67b5b, hw 0039d7074000, op 002b9a575000 } vmptrld (XEN) [221256.839320] ** frames [ 82d0801f4c31 vmx_do_resume+0x51/0x150 ] (XEN) [221256.882850] **[07] { now c93b59e71e48, hw 002b9a575000, op 0039d7074000 } vmptrld (XEN) [221256.893034] ** frames [ 82d0801f4d13 vmx_do_resume+0x133/0x150 ] (XEN) [221256.900790] **[08] { now c93b59e78675, hw 0039d7074000, op 0040077ae000 } vmptrld (XEN) [221256.910968] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221256.919015] **[09] { now c93b59e78ac8, hw 0040077ae000, op 0040077ae000 } vmclear (XEN) [221256.929196] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221256.938117] **[10] { now c93b59e78d72, hw , op 0040077ae000 } vmptrld (XEN) [221256.948297] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221256.956345] **[11] { now c93b59e78ff0, hw 0040077ae000, op 0040077ae000 } vmclear (XEN) [221256.966525] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221256.975445] **[12] { now c93b59e7deda, hw , op 0040077b3000 } vmptrld (XEN) [221256.985626] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221256.993672] **[13] { now c93b59e9fe00, hw 0040077b3000, op 0040077b3000 } vmclear (XEN) [221257.003852] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221257.012772] **[14] { now c93b59ea007e, hw , op 0040077b3000 } vmptrld (XEN) [221257.022952] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221257.031000] **[15] { now c93b59ea02ba, hw 0040077b3000, op 0040077b3000 } vmclear (XEN) [221257.041180] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221257.050101] (XEN) [221257.053008] vmcs_ptr:0x, vcpu->vmcs:0x2b9a575000 vmcs is loaded and between the next call to vm_write, there is a clear of vmcs caused by vmx_vcpu_disable_pml. Above log highlights that IPI is clearing the vmcs in between vmptrld and vmwrite but I also verified that interrupts are disabled during context switch and execution of vm_write in vmx_fpu_leave.. This has got me confused. Also, I am not sure if I understand the handling of foreign_vmcs correctly, which can also be the cause of the race. Please if you can share some suggestions here. Thanks Anshul Makkar ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org
[Xen-devel] PML causing race condition during guest bootstorm and host crash on Broadwell cpu.
Hi, Facing a issue where bootstorm of guests leads to host crash. I debugged and found that that enabling PML introduces a race condition during guest teardown stage while disabling PML on a vcpu and context switch happening for another vcpu. Crash happens only on Broadwell processors as PML got introduced in this generation. Here is my analysis: Race condition: vmcs.c vmx_vcpu_disable_pml (vcpu){ vmx_vmcs_enter() ; vm_write( disable_PML); vmx_vmcx_exit();) In between I have a callpath from another pcpu executing context switch-> vmx_fpu_leave() and crash on vmwrite.. if ( !(v->arch.hvm_vmx.host_cr0 & X86_CR0_TS) ) { v->arch.hvm_vmx.host_cr0 |= X86_CR0_TS; __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); //crash } Debug logs XEN) [221256.749928] VMWRITE VMCS Invalid ! (XEN) [221256.754870] **[00] { now c93b4341df1d, hw 0035fffea000, op 0035fffea000 } vmclear (XEN) [221256.765052] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221256.773969] **[01] { now c93b4341e099, hw , op 0035fffea000 } vmptrld (XEN) [221256.784150] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221256.792197] **[02] { now c93b4341e1f1, hw 0035fffea000, op 0035fffea000 } vmclear (XEN) [221256.802378] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221256.811298] **[03] { now c93b5784dd0a, hw , op 0039d7074000 } vmptrld (XEN) [221256.821478] ** frames [ 82d0801f4c31 vmx_do_resume+0x51/0x150 ] (XEN) [221256.829139] **[04] { now c93b59d67b5b, hw 0039d7074000, op 002b9a575000 } vmptrld (XEN) [221256.839320] ** frames [ 82d0801f4c31 vmx_do_resume+0x51/0x150 ] (XEN) [221256.882850] **[07] { now c93b59e71e48, hw 002b9a575000, op 0039d7074000 } vmptrld (XEN) [221256.893034] ** frames [ 82d0801f4d13 vmx_do_resume+0x133/0x150 ] (XEN) [221256.900790] **[08] { now c93b59e78675, hw 0039d7074000, op 0040077ae000 } vmptrld (XEN) [221256.910968] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221256.919015] **[09] { now c93b59e78ac8, hw 0040077ae000, op 0040077ae000 } vmclear (XEN) [221256.929196] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221256.938117] **[10] { now c93b59e78d72, hw , op 0040077ae000 } vmptrld (XEN) [221256.948297] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221256.956345] **[11] { now c93b59e78ff0, hw 0040077ae000, op 0040077ae000 } vmclear (XEN) [221256.966525] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221256.975445] **[12] { now c93b59e7deda, hw , op 0040077b3000 } vmptrld (XEN) [221256.985626] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221256.993672] **[13] { now c93b59e9fe00, hw 0040077b3000, op 0040077b3000 } vmclear (XEN) [221257.003852] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221257.012772] **[14] { now c93b59ea007e, hw , op 0040077b3000 } vmptrld (XEN) [221257.022952] ** frames [ 82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ] (XEN) [221257.031000] **[15] { now c93b59ea02ba, hw 0040077b3000, op 0040077b3000 } vmclear (XEN) [221257.041180] ** frames [ 82d080134652 smp_call_function_interrupt+0x92/0xa0 ] (XEN) [221257.050101] (XEN) [221257.053008] vmcs_ptr:0x, vcpu->vmcs:0x2b9a575000 vmcs is loaded and between the next call to vm_write, there is a clear of vmcs caused by vmx_vcpu_disable_pml. Above log highlights that IPI is clearing the vmcs in between vmptrld and vmwrite but I also verified that interrupts are disabled during context switch and execution of vm_write in vmx_fpu_leave.. This has got me confused. Also, I am not sure if I understand the handling of foreign_vmcs correctly, which can also be the cause of the race. Please if you can share some suggestions here. Thanks Anshul Makkar ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] x86/VMX: Implement vmptrst
On 11/01/17 05:37, Tian, Kevin wrote: From: Anshul Makkar [mailto:anshul.mak...@citrix.com] Sent: Friday, January 06, 2017 2:42 AM Current codebase doesn't implement and use vmptrst. Implementing it as it may be required in future. Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> Then let's do it when it's really required. :-) I needed it to debug an issue, found it missing and that's why I added. I have a sanity test, that continuously uses it. There may be other users out there who can use it for similar or some other purpose. Having it in standard code base will do no harm. Thanks Kevin Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] x86/VMX: Implement vmptrst
Current codebase doesn't implement and use vmptrst. Implementing it as it may be required in future. Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- xen/include/asm-x86/hvm/vmx/vmx.h | 22 ++ 1 file changed, 22 insertions(+) diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h index e5c6499..2db6c1d 100644 --- a/xen/include/asm-x86/hvm/vmx/vmx.h +++ b/xen/include/asm-x86/hvm/vmx/vmx.h @@ -328,6 +328,28 @@ static always_inline void __vmptrld(u64 addr) : "memory"); } +static always_inline u64 __vmptrst(void) +{ +u64 paddr; + +asm volatile ( +#ifdef HAVE_GAS_VMX + "vmptrst %0\n" +#else + VMPTRST_OPCODE MODRM_EAX_07 +#endif + +#ifdef HAVE_GAS_VMX + : "=m" (paddr) + : +#else + : + : "a" (), +#endif + : "memory"); +return paddr; +} + static always_inline void __vmpclear(u64 addr) { asm volatile ( -- 2.7.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen 4.8 + Linux 4.9 + Credit2 = can't bootup
On 05/01/2017 08:39, Dario Faggioli wrote: > On Thu, 2017-01-05 at 02:05 +, Andrew Cooper wrote: >> On 05/01/2017 01:52, Konrad Rzeszutek Wilk wrote: >>> It works just fine with credit1 (see further down the log) >>> but if I try credit2 it ends up hanging during bootup. >>> >>> I am a going to naively assume it is due to how the vCPUs are >>> exposed (Where they match the physical CPUs under credit1), >>> but under credit2 they are different. >>> >>> The dom0_max_vcpus does not seem to have any affect. When I remove >>> it >>> things are still being problematic. >>> >>> Help!? >> >> This matches the symptoms seen by XenServer when trying to stress >> 32vcpu >> guests under Credit2. Malcolm did find (based on interpreted iperf >> throughput graphs) that Credit2 did seem to preferentially schedule >> the >> lower-number vcpus, rather than scheduling them evenly. >> > To be fair (and just for the records, since the cause seems actually to > be something else), this was with an old version (at least two Xen > releases ago, IIRC, certainly not 4.8) and known to be buggy version of > Credit2. > > We have other tests and benchmarks, done on equally big machines which > proves the scheduler is 100% functional. > Yes, I have done extensive stress tests on XenSever(IIRC Xen - 4.7) with Credit2 on 32 or more VCPU guests. Didn't see any hangups. Malcolm, did find hangup and crash issues with 32VCPU guests, but the issue was highlighted and fixed. So, during my testing I specifically foucussed on this scenario and found no problem. > Regards, > Dario > Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xsm: allow relevant permission during migrate and gpu-passthrough.
On 03/01/17 18:20, Daniel De Graaf wrote: On 12/19/2016 11:03 PM, Doug Goldstein wrote: On 12/19/16 10:02 AM, Doug Goldstein wrote: On 12/14/16 3:09 PM, Daniel De Graaf wrote: On 12/12/2016 09:00 AM, Anshul Makkar wrote: During guest migrate allow permission to prevent spurious page faults. Prevents these errors: d73: Non-privileged (73) attempt to map I/O space avc: denied { set_misc_info } for domid=0 target=11 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=domain GPU passthrough for hvm guest: avc: denied { send_irq } for domid=0 target=10 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=hvm Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> Acked-by: Daniel De Graaf <dgde...@tycho.nsa.gov> Daniel, Should this be backported to 4.8? Yes, I would consider this a candidate for backporting. FWIW, Daniel's email is bouncing. Anshul, do you want to test/confirm? I believe this is fixed now; my email server was changed while I was gone for the holiday and apparently the change was not tested properly. Please backport the patch to stable-4.8. I have tested it. Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Future support of 5-level paging in Xen
On 08/12/16 23:40, Boris Ostrovsky wrote: On 12/08/2016 05:21 PM, Andrew Cooper wrote: On 08/12/2016 19:18, Stefano Stabellini wrote: Of course even the largest virtual machine today (2TB on Amazon AFAIK) is not close to reaching the current memory limit, but it's just a matter of time. /me things Oracle will have something to say about this. I'm sure there was talk about VMs larger than this at previous hackathons. XenServer functions (ish, so long as you don't migrate) with 6TB VMs, although starting and shutting them down feels like treacle. I've been working (on and off) with SGI to get one of their 32TB boxes to boot and I don't think that works. We've fixed a couple of bugs but I don't think Xen can boot with that much memory. We successfully booted with just under 8TB but couldn't do it with the full system. The machine has been taken from us for now so this work is on hold. This is on OVM, which is 4.4-based, we haven't tried (IIRC) latest bits. (BTW, speaking of slow starting and shutting down very large guests --- have you or anyone else had a chance to look at this? My investigation initially pointed to scrubbing and then to an insane number of hypercall preemptions in relinquish_memory()). I had a quick look at it when I was working on support for large guest and found that scrubbing was indeed one of the issue. Just haven't got time to look at it in more details. Hopefully in near future, might work on it. -boris ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xsm: allow relevant permission during migrate and gpu-passthrough.
On 20/12/2016 04:03, Doug Goldstein wrote: On 12/19/16 10:02 AM, Doug Goldstein wrote: On 12/14/16 3:09 PM, Daniel De Graaf wrote: On 12/12/2016 09:00 AM, Anshul Makkar wrote: During guest migrate allow permission to prevent spurious page faults. Prevents these errors: d73: Non-privileged (73) attempt to map I/O space avc: denied { set_misc_info } for domid=0 target=11 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=domain GPU passthrough for hvm guest: avc: denied { send_irq } for domid=0 target=10 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=hvm Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> Acked-by: Daniel De Graaf <dgde...@tycho.nsa.gov> Daniel, Should this be backported to 4.8? FWIW, Daniel's email is bouncing. Anshul, do you want to test/confirm? Doug, yes, will backport and test. Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] xsm: allow relevant permission during migrate and gpu-passthrough.
During guest migrate allow permission to prevent spurious page faults. Prevents these errors: d73: Non-privileged (73) attempt to map I/O space avc: denied { set_misc_info } for domid=0 target=11 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=domain GPU passthrough for hvm guest: avc: denied { send_irq } for domid=0 target=10 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=hvm Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- tools/flask/policy/modules/xen.if |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/flask/policy/modules/xen.if b/tools/flask/policy/modules/xen.if index eb646f5..1aca75d 100644 --- a/tools/flask/policy/modules/xen.if +++ b/tools/flask/policy/modules/xen.if @@ -49,7 +49,7 @@ define(`create_domain_common', ` allow $1 $2:domain { create max_vcpus setdomainmaxmem setaddrsize getdomaininfo hypercall setvcpucontext getscheduler getvcpuinfo getaddrsize getaffinity setaffinity - settime setdomainhandle getvcpucontext }; + settime setdomainhandle getvcpucontext set_misc_info }; allow $1 $2:domain2 { set_cpuid settsc setscheduler setclaim set_max_evtchn set_vnumainfo get_vnumainfo cacheflush psr_cmt_op psr_cat_op soft_reset }; @@ -58,7 +58,7 @@ define(`create_domain_common', ` allow $1 $2:mmu { map_read map_write adjust memorymap physmap pinpage mmuext_op updatemp }; allow $1 $2:grant setup; allow $1 $2:hvm { cacheattr getparam hvmctl irqlevel pciroute sethvmc - setparam pcilevel trackdirtyvram nested altp2mhvm altp2mhvm_op }; + setparam pcilevel trackdirtyvram nested altp2mhvm altp2mhvm_op send_irq }; ') # create_domain(priv, target) -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] xsm support for restrictive policy and dynamic modification
Hi, In Xen-Summit, Toronto, I did a presentation on XSM where two shortcomings with the present implementation were highlighted: 1) Allow more finer grain control. e.g all introspection domain doesn't need to have pci-passthrough capability while few need to have and all should have introspection capability. 2) Need for dynamic modification, loading/unloading of the policy. In response to the above I got few suggestion from the community (Thanks Daniel De Graf and Stephen Smalley and others) and after researching the alternatives I found that already existent XSM support for MCS and MLS and type bound can fit in for finer grain control. Regarding dynamic support, its something that is still desired and not implemented. Stephan shared few links related to these: this:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.7129=rep1=pdf http://oss.tresys.com/archive/policy-server.php Going forward, I will be focusing towards dynamic implementation as I believe currently existing infrastructure is sufficient enough to provide finer level of control over domains. Thanks Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 04/10] xen: credit2: only reset credit on reset condition
On 30/09/16 03:53, Dario Faggioli wrote: The condition for a Credit2 scheduling epoch coming to an end is that the vcpu at the front of the runqueue has negative credits. However, it is possible, that runq_candidate() does not actually return to the scheduler the first vcpu in the runqueue (e.g., because such vcpu can't run on the cpu that is going through the scheduler, because of hard-affinity). If that happens, we should not trigger a credit reset, or we risk altering the lenght of a scheduler epoch, wrt what the original idea of the algorithm was. Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> --- Cc: George Dunlap <george.dun...@eu.citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> --- /* * If a vcpu is meant to be picked up by another processor, and such @@ -2282,6 +2288,7 @@ runq_candidate(struct csched2_runqueue_data *rqd, if ( svc->tickled_cpu != -1 && svc->tickled_cpu != cpu && cpumask_test_cpu(svc->tickled_cpu, >tickled) ) { +(*skipped)++; SCHED_STAT_CRANK(deferred_to_tickled_cpu); continue; } @@ -2291,6 +2298,7 @@ runq_candidate(struct csched2_runqueue_data *rqd, if ( svc->vcpu->processor != cpu && snext->credit + CSCHED2_MIGRATE_RESIST > svc->credit ) { +(*skipped)++; SCHED_STAT_CRANK(migrate_resisted); continue; } @@ -2308,11 +2316,12 @@ runq_candidate(struct csched2_runqueue_data *rqd, { struct { unsigned vcpu:16, dom:16; -unsigned tickled_cpu; +unsigned tickled_cpu, skipped; } d; d.dom = snext->vcpu->domain->domain_id; d.vcpu = snext->vcpu->vcpu_id; d.tickled_cpu = snext->tickled_cpu; +d.skipped = *skipped; __trace_var(TRC_CSCHED2_RUNQ_CANDIDATE, 1, sizeof(d), (unsigned char *)); @@ -2336,6 +2345,7 @@ csched2_schedule( struct csched2_runqueue_data *rqd; struct csched2_vcpu * const scurr = CSCHED2_VCPU(current); struct csched2_vcpu *snext = NULL; +unsigned int skipped_vcpus = 0; struct task_slice ret; SCHED_STAT_CRANK(schedule); @@ -2385,7 +2395,7 @@ csched2_schedule( snext = CSCHED2_VCPU(idle_vcpu[cpu]); } else -snext = runq_candidate(rqd, scurr, cpu, now); +snext = runq_candidate(rqd, scurr, cpu, now, _vcpus); /* If switching from a non-idle runnable vcpu, put it * back on the runqueue. */ @@ -2409,8 +2419,21 @@ csched2_schedule( __set_bit(__CSFLAG_scheduled, >flags); } -/* Check for the reset condition */ -if ( snext->credit <= CSCHED2_CREDIT_RESET ) +/* + * The reset condition is "has a scheduler epoch come to an end?". + * The way this is enforced is checking whether the vcpu at the top + * of the runqueue has negative credits. This means the epochs have + * variable lenght, as in one epoch expores when: + * 1) the vcpu at the top of the runqueue has executed for + * around 10 ms (with default parameters); + * 2) no other vcpu with higher credits wants to run. + * + * Here, where we want to check for reset, we need to make sure the + * proper vcpu is being used. In fact, runqueue_candidate() may have + * not returned the first vcpu in the runqueue, for various reasons + * (e.g., affinity). Only trigger a reset when it does. + */ +if ( skipped_vcpus == 0 && snext->credit <= CSCHED2_CREDIT_RESET ) { Is there no other way of checking that the returned vcpu from runqueue_candidate is the first one apart from using "skipped_vcpu" extra variable. Looks a bit ugly. reset_credit(ops, cpu, now, snext); balance_load(ops, cpu, now); ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 17/24] xen: credit2: soft-affinity awareness in runq_tickle()
On 05/09/16 15:55, Dario Faggioli wrote: On Thu, 2016-09-01 at 11:52 +0100, anshul makkar wrote: On 17/08/16 18:19, Dario Faggioli wrote: +/* + * We're doing soft-affinity, and we know that the current vcpu on cpu + * has a soft affinity. We now want to know whether cpu itself is in Please can you explain the above statment. If the vcpu has soft affinity and its currently executing, doesn;t it always means that its running on one of the pcpu which is there in its soft affinity or hard affinity? A vcpu will always run on a pcpu from its own hard-affinity (that's the definition of hard-affinity). On the other hand, a vcpu will, most of the time, run on a cpu from its own soft affinity, but can run on a cpu that is in its hard-affinity, but *IS NOT* in its soft-affinity. That's the definition of soft-affinity: the scheduler will try to run it there, but it that can't happen, it will run it will run it outside of it (but still within its hard-affinity, of course). So, yes, we know already that it's running in a cpu at least from its hard affinity, what is it exactly that you are not understanding? If I put it simply , can (X being a vcpu) x {soft affinity pcpus} Intersect x { hard affinity pcpu} -> be Null or disjoint set ? and x{runnable pcpu} intersect (x{hard affinity pcpu} union x{soft affinity pcpu} ) -> be null or disjoint ?? + * such affinity. In fact, since we now that new (in runq_tickle()) is Typo: * such affinity. In fact, since now we know that new (in runq_tickle()) is Thanks. :-) + */ Regards, Dario Anshul Makkar ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 18/24] xen: credit2: soft-affinity awareness fallback_cpu() and cpu_pick()
On 05/09/16 14:26, Dario Faggioli wrote: On Thu, 2016-09-01 at 12:08 +0100, anshul makkar wrote: On 17/08/16 18:19, Dario Faggioli wrote: Can't we just read their workload or we can change the locktype to allow reading ? Reading without taking the lock would race against the load value being updated. And updating the load is done by __update_runq_load(), which, with all it's shifts and maths, by no means is an atomic operation. So it's not just a matter of risking to read a slightly outdated value, which, I agree, may not be that bad, it's that we risk reading something inconsistent. :-/ Ok. Got it and agree. About "changing the locktype", I guess you mean that we can turn also the runqueue lock into an rw-lock? If yes, that's indeed interesting, and I've also thought about it, but, for now, always deferred trying to actually do that. Yes. It's technically non trivial, as it would involve changing schedule_data->schedule_lock and all the {v,p}cpu_schedule_lock*() functions. Also, it's a lock that will almost all the times be taken for writing, which usually means what you want is a proper spinlock. So, IMO, before embarking in doing something like that, it'd be good to figure out how frequently we actually fail to take the remote runqueue lock, and what's the real impact of having to deal the consequence of that. Ok. Lets discuss on that to finalize the approach. I'm not saying it's not worth a try, but I'm saying that's it's something at high risk of being a lot of work for a very small gain, and that there are more important things to focus on. + * we pick, in order of decreasing preference: + * 1) svc's current pcpu, if it is part of svc's soft affinity; + * 2) a pcpu in svc's current runqueue that is also in svc's soft affinity; svc's current runqueue. Do you mean the runq in which svc is currently queued ? I mean the runqueue to which svc is currently assigned (again, as per runq_assign()/runq_deassing()), which in turns mean that, if svc is queued in a runqueue, it's queues there (so, I guess the short answer to your question is "yes" :-D). Ok. Regards, Dario Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 05/24] xen: credit2: make tickling more deterministic
On 05/09/16 14:47, Dario Faggioli wrote: On Wed, 2016-08-31 at 18:10 +0100, anshul makkar wrote: On 17/08/16 18:18, Dario Faggioli wrote: @@ -1266,6 +1272,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd) ASSERT(svc->sdom != NULL); svc->credit = CSCHED2_CREDIT_INIT; svc->weight = svc->sdom->weight; +svc->tickled_cpu = -1; /* Starting load of 50% */ svc->avgload = 1ULL << (CSCHED2_PRIV(ops)- load_precision_shift - 1); svc->load_last_update = NOW() >> LOADAVG_GRANULARITY_SHIFT; @@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd) else { ASSERT(svc->sdom == NULL); +svc->tickled_cpu = svc->vcpu->vcpu_id; If I understood correctly, tickled_cpu refers to pcpu and not a vcpu. Saving vcpu_id in tickled_cpu looks wrong. Yes, and in fact, as you can see in the previous hunk, for pretty much all vcpus, tickled_cpu is initialized to -1. Here, we are dealing with the vcpus of the idle domain. And for vcpus of the idle domain, their vcpu id is the same as the id of the pcpu they're associated to. Ah, that makes it clear . I agree it looks a little bit weird, but it's both correct, and the easiest and cleanest way for initializing this. I guess I can add a comment... That will be useful. Thanks and Regards, Dario Thanks Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 23/24] xen: credit2: optimize runq_tickle() a little bit
On 17/08/16 18:20, Dario Faggioli wrote: By not looking at the same cpu (to check whether we want to preempt who's running there) twice, if the vcpu being woken up has both soft and hard affinity. In fact, all the cpus that are part of both soft affinity and hard-affinity (of the waking vcpu) are checked during the soft-affinity balancing step. If none turns out to be suitable, e.g., because they're running vcpus with higher credits, there's no point in re-checking them, only to re-assess the same, during the hard-affinity step. Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> --- Cc: George Dunlap <george.dun...@citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> --- xen/common/sched_credit2.c | 43 +++ 1 file changed, 39 insertions(+), 4 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 6963872..f03ecce 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -997,7 +997,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) s_time_t lowest = (1<<30); unsigned int bs, cpu = new->vcpu->processor; struct csched2_runqueue_data *rqd = RQD(ops, cpu); -cpumask_t mask; +cpumask_t mask, skip_mask; struct csched2_vcpu * cur; ASSERT(new->rqd == rqd); @@ -1017,6 +1017,13 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) (unsigned char *)); } +/* + * Cpus that end up in this mask, have been already checked during the + * soft-affinity step, and need not to be checked again when doing hard + * affinity. + */ +cpumask_clear(_mask); + for_each_affinity_balance_step( bs ) { /* @@ -1073,7 +1080,8 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) cpumask_andnot(, >active, >idle); cpumask_andnot(, , >tickled); cpumask_and(, , cpumask_scratch); -if ( cpumask_test_cpu(cpu, ) ) +if ( cpumask_test_cpu(cpu, ) && + !cpumask_test_cpu(cpu, _mask) ) { cur = CSCHED2_VCPU(curr_on_cpu(cpu)); @@ -1102,13 +1110,26 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) ipid = cpu; goto tickle; } + +/* + * If we're here, cpu is just not a valid candidate for being + * tickled. Set its bit in skip_mask, to avoid calling + * burn_credits() and check its current vcpu for preemption + * twice. + */ +__cpumask_set_cpu(cpu, _mask); } } for_each_cpu(i, ) { -/* Already looked at this one above */ -if ( i == cpu ) +/* + * Already looked at these ones above, either because it's the + * cpu where new was running before, or because we are at the + * hard-affinity step, and we checked this during the + * soft-affinity one + */ Sorry for my naiveness here, but can we be sure that situation has not changed since we checked during soft-affinity step. ? +if ( i == cpu || cpumask_test_cpu(i, _mask) ) continue; cur = CSCHED2_VCPU(curr_on_cpu(i)); @@ -1139,6 +1160,20 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) ipid = i; lowest = cur->credit; } + +/* + * No matter if i is the new lowest or not. We've run + * burn_credits() on it, and we've checked it for preemption. + * + * If we are at soft-affinity balancing step, and i is indeed + * the lowest, it will be tickled (and we exit the function). + * If it is not the lowest among the cpus in the soft-affinity + * mask, it can't be the lowest among the cpus in the hard + * affinity mask (assuming we'll actually do the second + * balancing step), as hard-affinity is a superset of soft + * affinity, and therefore we can flag it to be skipped. + */ +__cpumask_set_cpu(i, _mask); } } ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 19/24] xen: credit2: soft-affinity awareness in load balancing
On 17/08/16 18:19, Dario Faggioli wrote: We want is soft-affinity to play a role in load balancing, i.e., when deciding whether or not to something like that at some point. (Oh, and while there, just a couple of style fixes are also done.) Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> --- Cc: George Dunlap <george.dun...@citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> --- xen/common/sched_credit2.c | 359 1 file changed, 326 insertions(+), 33 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 2d7228a..3722f46 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -1786,19 +1786,21 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc) return new_cpu; } -/* Working state of the load-balancing algorithm */ +/* Working state of the load-balancing algorithm. */ typedef struct { -/* NB: Modified by consider() */ +/* NB: Modified by consider(). */ s_time_t load_delta; struct csched2_vcpu * best_push_svc, *best_pull_svc; -/* NB: Read by consider() */ +/* NB: Read by consider() (and the various consider_foo() functions). */ struct csched2_runqueue_data *lrqd; -struct csched2_runqueue_data *orqd; +struct csched2_runqueue_data *orqd; +bool_t push_has_soft_aff, pull_has_soft_aff; +s_time_t push_soft_aff_load, pull_soft_aff_load; } balance_state_t; -static void consider(balance_state_t *st, - struct csched2_vcpu *push_svc, - struct csched2_vcpu *pull_svc) +static inline s_time_t consider_load(balance_state_t *st, + struct csched2_vcpu *push_svc, + struct csched2_vcpu *pull_svc) { s_time_t l_load, o_load, delta; @@ -1821,11 +1823,166 @@ static void consider(balance_state_t *st, if ( delta < 0 ) delta = -delta; +return delta; +} + +/* + * Load balancing and soft-affinity. + * + * When trying to figure out whether or not it's best to move a vcpu from + * one runqueue to another, we must keep soft-affinity in mind. Intuitively + * we would want to know the following: + * - 'how much' affinity does the vcpu have with its current runq? + * - 'how much' affinity will it have with its new runq? + * + * But we certainly need to be more precise about how much it is that 'how + * much'! Let's start with some definitions: + * + * - let v be a vcpu, running in runq I, with soft-affinity to vi + *pcpus of runq I, and soft affinity with vj pcpus of runq J; + * - let k be another vcpu, running in runq J, with soft-affinity to kj + *pcpus of runq J, and with ki pcpus of runq I; + * - let runq I have Ci pcpus, and runq J Cj pcpus; + * - let vcpu v have an average load of lv, and k an average load of lk; + * - let runq I have an average load of Li, and J an average load of Lj. + * + * We also define the following:: + * + * - lvi = lv * (vi / Ci) as the 'perceived load' of v, when running + * in runq i; + * - lvj = lv * (vj / Cj) as the 'perceived load' of v, it running + * in runq j; + * - the same for k, mutatis mutandis. + * + * Idea is that vi/Ci (i.e., the ratio of the number of cpus of a runq that + * a vcpu has soft-affinity with, over the total number of cpus of the runq + * itself) can be seen as the 'degree of soft-affinity' of v to runq I (and + * vj/Cj the one of v to J). In other words, we define the degree of soft + * affinity of a vcpu to a runq as what fraction of pcpus of the runq itself + * the vcpu has soft-affinity with. Then, we multiply this 'degree of + * soft-affinity' by the vcpu load, and call the result the 'perceived load'. + * + * Basically, if a soft-affinity is defined, the work done by a vcpu on a + * runq to which it has higher degree of soft-affinity, is considered + * 'lighter' than the same work done by the same vcpu on a runq to which it + * has smaller degree of soft-affinity (degree of soft affinity is <= 1). In + * fact, if soft-affinity is used to achieve NUMA-aware scheduling, the higher + * the degree of soft-affinity of the vcpu to a runq, the greater the probability + * of accessing local memory, when running on such runq. And that is certainly\ + * 'lighter' than having to fetch memory from remote NUMA nodes. Do we ensure that while defining soft-affinity for a vcpu, NUMA architecture is considered. If not, then this whole calculation can go wrong and have negative impact on performance. Degree of affinity to runq will give good result if the affinity to pcpus has been chosen after due consideration .. + * + * SoXX, evaluating pushing v from I to J would mean removing (from I) a + * perceived load of lv*(vi/Ci) and adding (to J) a perceived load of + * lv*(vj/Cj), which we (looking at things from the point of view of I, + * which is what balan
Re: [Xen-devel] [PATCH 18/24] xen: credit2: soft-affinity awareness fallback_cpu() and cpu_pick()
On 17/08/16 18:19, Dario Faggioli wrote: For get_fallback_cpu(), by putting in place the "usual" two steps (soft affinity step and hard affinity step) loop. We just move the core logic of the function inside the body of the loop itself. For csched2_cpu_pick(), what is important is to find the runqueue with the least average load. Currently, we do that by looping on all runqueues and checking, well, their load. For soft affinity, we want to know which one is the runqueue with the least load, among the ones where the vcpu would prefer to be assigned. We find both the least loaded runqueue among the soft affinity "friendly" ones, and the overall least loaded one, in the same pass. (Also, kill a spurious ';' when defining MAX_LOAD.) Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> Signed-off-by: Justin T. Weaver <jtwea...@hawaii.edu> --- Cc: George Dunlap <george.dun...@citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> --- xen/common/sched_credit2.c | 136 1 file changed, 111 insertions(+), 25 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 3aef1b4..2d7228a 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -506,34 +506,68 @@ void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask) } /* - * When a hard affinity change occurs, we may not be able to check some - * (any!) of the other runqueues, when looking for the best new processor - * for svc (as trylock-s in csched2_cpu_pick() can fail). If that happens, we - * pick, in order of decreasing preference: - * - svc's current pcpu; - * - another pcpu from svc's current runq; - * - any cpu. + * In csched2_cpu_pick(), it may not be possible to actually look at remote + * runqueues (the trylock-s on their spinlocks can fail!). If that happens, With remote runqueues , do you mean runqs on remote socket ? Can't we just read their workload or we can change the locktype to allow reading ? + * we pick, in order of decreasing preference: + * 1) svc's current pcpu, if it is part of svc's soft affinity; + * 2) a pcpu in svc's current runqueue that is also in svc's soft affinity; svc's current runqueue. Do you mean the runq in which svc is currently queued ? + * 3) just one valid pcpu from svc's soft affinity; + * 4) svc's current pcpu, if it is part of svc's hard affinity; + * 5) a pcpu in svc's current runqueue that is also in svc's hard affinity; + * 6) just one valid pcpu from svc's hard affinity + * + * Of course, 1, 2 and 3 makes sense only if svc has a soft affinity. Also + * note that at least 6 is guaranteed to _always_ return at least one pcpu. */ static int get_fallback_cpu(struct csched2_vcpu *svc) { int cpu; +unsigned int bs; -if ( likely(cpumask_test_cpu(svc->vcpu->processor, - svc->vcpu->cpu_hard_affinity)) ) -return svc->vcpu->processor; +for_each_affinity_balance_step( bs ) +{ +if ( bs == BALANCE_SOFT_AFFINITY && + !has_soft_affinity(svc->vcpu, svc->vcpu->cpu_hard_affinity) ) +continue; Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 17/24] xen: credit2: soft-affinity awareness in runq_tickle()
On 17/08/16 18:19, Dario Faggioli wrote: This is done by means of the "usual" two steps loop: - soft affinity balance step; - hard affinity balance step. The entire logic implemented in runq_tickle() is applied, during the first step, considering only the CPUs in the vcpu's soft affinity. In the second step, we fall back to use all the CPUs from its hard affinity (as it is doing now, without this patch). Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> Signed-off-by: Justin T. Weaver <jtwea...@hawaii.edu> --- Cc: George Dunlap <george.dun...@citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> --- xen/common/sched_credit2.c | 243 1 file changed, 157 insertions(+), 86 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 0d83bd7..3aef1b4 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -902,6 +902,42 @@ __runq_remove(struct csched2_vcpu *svc) list_del_init(>runq_elem); } +/* + * During the soft-affinity step, only actually preempt someone if + * he does not have soft-affinity with cpu (while we have). + * + * BEWARE that this uses cpumask_scratch, trowing away what's in there! Typo:* BEWARE that this uses cpumask_scratch, throwing away what's in there! + */ +static inline bool_t soft_aff_check_preempt(unsigned int bs, unsigned int cpu) +{ +struct csched2_vcpu * cur = CSCHED2_VCPU(curr_on_cpu(cpu)); + +/* + * If we're doing hard-affinity, always check whether to preempt cur. + * If we're doing soft-affinity, but cur doesn't have one, check as well. + */ +if ( bs == BALANCE_HARD_AFFINITY || + !has_soft_affinity(cur->vcpu, cur->vcpu->cpu_hard_affinity) ) +return 1; + +/* + * We're doing soft-affinity, and we know that the current vcpu on cpu + * has a soft affinity. We now want to know whether cpu itself is in Please can you explain the above statment. If the vcpu has soft affinity and its currently executing, doesn;t it always means that its running on one of the pcpu which is there in its soft affinity or hard affinity? + * such affinity. In fact, since we now that new (in runq_tickle()) is Typo: * such affinity. In fact, since now we know that new (in runq_tickle()) is + * - if cpu is not in cur's soft-affinity, we should indeed check to + *see whether new should preempt cur. If that will be the case, that + *would be an improvement wrt respecting soft affinity; + * - if cpu is in cur's soft-affinity, we leave it alone and (in + *runq_tickle()) move on to another cpu. In fact, we don't want to + *be too harsh with someone which is running within its soft-affinity. + *This is safe because later, if we don't fine anyone else during the + *soft-affinity step, we will check cpu for preemption anyway, when + *doing hard-affinity. + */ +affinity_balance_cpumask(cur->vcpu, BALANCE_SOFT_AFFINITY, cpumask_scratch); +return !cpumask_test_cpu(cpu, cpumask_scratch); +} + void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *, s_time_t); /* @@ -925,7 +961,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) { int i, ipid = -1; s_time_t lowest = (1<<30); -unsigned int cpu = new->vcpu->processor; +unsigned int bs, cpu = new->vcpu->processor; struct csched2_runqueue_data *rqd = RQD(ops, cpu); cpumask_t mask; struct csched2_vcpu * cur; @@ -947,109 +983,144 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) (unsigned char *)); } -/* - * First of all, consider idle cpus, checking if we can just - * re-use the pcpu where we were running before. - * - * If there are cores where all the siblings are idle, consider - * them first, honoring whatever the spreading-vs-consolidation - * SMT policy wants us to do. - */ -if ( unlikely(sched_smt_power_savings) ) -cpumask_andnot(, >idle, >smt_idle); -else -cpumask_copy(, >smt_idle); -cpumask_and(, , new->vcpu->cpu_hard_affinity); -i = cpumask_test_or_cycle(cpu, ); -if ( i < nr_cpu_ids ) +for_each_affinity_balance_step( bs ) { -SCHED_STAT_CRANK(tickled_idle_cpu); -ipid = i; -goto tickle; -} +/* + * First things first: if we are at the first (soft affinity) step, + * but new doesn't have a soft affinity, skip this step. + */ +if ( bs == BALANCE_SOFT_AFFINITY && + !has_soft_affinity(new->vcpu, new->vcpu->cpu_hard_affinity) ) +continue; -/* - * If there are no fully idle cores, check all idlers, after - * having filtered out pcpus that have been tickled but haven't -
Re: [Xen-devel] [PATCH 05/24] xen: credit2: make tickling more deterministic
On 17/08/16 18:18, Dario Faggioli wrote: Right now, the following scenario can occurr: - upon vcpu v wakeup, v itself is put in the runqueue, and pcpu X is tickled; - pcpu Y schedules (for whatever reason), sees v in the runqueue and picks it up. This may seem ok (or even a good thing), but it's not. In fact, if runq_tickle() decided X is where v should run, it did it for a reason (load distribution, SMT support, cache hotness, affinity, etc), and we really should try as hard as possible to stick to that. Of course, we can't be too strict, or we risk leaving vcpus in the runqueue while there is available CPU capacity. So, we only leave v in runqueue --for X to pick it up-- if we see that X has been tickled and has not scheduled yet, i.e., it will have a real chance of actually select and schedule v. If that is not the case, we schedule it on Y (or, at least, we consider that), as running somewhere non-ideal is better than not running at all. The commit also adds performance counters for each of the possible situations. Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> --- Cc: George Dunlap <george.dun...@citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> Cc: Jan Beulich <jbeul...@suse.com> Cc: Andrew Cooper <andrew.coop...@citrix.com> --- xen/common/sched_credit2.c | 65 +++--- xen/include/xen/perfc_defn.h |3 ++ 2 files changed, 64 insertions(+), 4 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 12dfd20..a3d7beb 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -54,6 +54,7 @@ #define TRC_CSCHED2_LOAD_CHECK TRC_SCHED_CLASS_EVT(CSCHED2, 16) #define TRC_CSCHED2_LOAD_BALANCE TRC_SCHED_CLASS_EVT(CSCHED2, 17) #define TRC_CSCHED2_PICKED_CPU TRC_SCHED_CLASS_EVT(CSCHED2, 19) +#define TRC_CSCHED2_RUNQ_CANDIDATE TRC_SCHED_CLASS_EVT(CSCHED2, 20) /* * WARNING: This is still in an experimental phase. Status and work can be found at the @@ -398,6 +399,7 @@ struct csched2_vcpu { int credit; s_time_t start_time; /* When we were scheduled (used for credit) */ unsigned flags; /* 16 bits doesn't seem to play well with clear_bit() */ +int tickled_cpu; /* cpu tickled for picking us up (-1 if none) */ /* Individual contribution to load */ s_time_t load_last_update; /* Last time average was updated */ @@ -1049,6 +1051,10 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) __cpumask_set_cpu(ipid, >tickled); smt_idle_mask_clear(ipid, >smt_idle); cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ); + +if ( unlikely(new->tickled_cpu != -1) ) +SCHED_STAT_CRANK(tickled_cpu_overwritten); +new->tickled_cpu = ipid; } /* @@ -1266,6 +1272,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd) ASSERT(svc->sdom != NULL); svc->credit = CSCHED2_CREDIT_INIT; svc->weight = svc->sdom->weight; +svc->tickled_cpu = -1; /* Starting load of 50% */ svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_precision_shift - 1); svc->load_last_update = NOW() >> LOADAVG_GRANULARITY_SHIFT; @@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd) else { ASSERT(svc->sdom == NULL); +svc->tickled_cpu = svc->vcpu->vcpu_id; If I understood correctly, tickled_cpu refers to pcpu and not a vcpu. Saving vcpu_id in tickled_cpu looks wrong. svc->credit = CSCHED2_IDLE_CREDIT; svc->weight = 0; } @@ -2233,7 +2241,8 @@ void __dump_execstate(void *unused); static struct csched2_vcpu * runq_candidate(struct csched2_runqueue_data *rqd, struct csched2_vcpu *scurr, - int cpu, s_time_t now) + int cpu, s_time_t now, + unsigned int *pos) { struct list_head *iter; struct csched2_vcpu *snext = NULL; @@ -2262,13 +2271,29 @@ runq_candidate(struct csched2_runqueue_data *rqd, /* Only consider vcpus that are allowed to run on this processor. */ if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) ) +{ +(*pos)++; continue; +} + +/* + * If a vcpu is meant to be picked up by another processor, and such + * processor has not scheduled yet, leave it in the runqueue for him. + */ +if ( svc->tickled_cpu != -1 && svc->tickled_cpu != cpu && + cpumask_test_cpu(svc->tickled_cpu, >tickled) ) +{ +(*pos)++; +SCHED_STAT_CRANK(deferred_to_tickled_cpu); +continue; +} /* If this is on a different processor, don't pull it unless
Re: [Xen-devel] [PATCH 4/4] tools/xenalyze: Allow automatic resizing of sample buffers
On 08/08/16 10:54, George Dunlap wrote: Rather than have large fixed-size buffers, start with smaller buffers and allow them to grow as needed (doubling each time), with a fairly large maximum. Allow this maximum to be set by a command-line parameter. Signed-off-by: George Dunlap <george.dun...@citrix.com> --- CC: Ian Jackson <ian.jack...@citrix.com> CC: Wei Liu <wei.l...@citrix.com> CC: Dario Faggioli <dario.faggi...@citrix.com> CC: Anshul Makkar <anshul.mak...@citrix.com> --- tools/xentrace/xenalyze.c | 95 +-- 1 file changed, 68 insertions(+), 27 deletions(-) diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c index 455cbdf..a4d8823 100644 --- a/tools/xentrace/xenalyze.c +++ b/tools/xentrace/xenalyze.c @@ -44,7 +44,8 @@ struct mread_ctrl; #define QHZ_FROM_HZ(_hz) (((_hz) << 10)/ 10) #define ADDR_SPACE_BITS 48 -#define DEFAULT_SAMPLE_SIZE 10240 +#define DEFAULT_SAMPLE_SIZE 1024 +#define DEFAULT_SAMPLE_MAX 1024*1024*32 #define DEFAULT_INTERVAL_LENGTH 1000 s->event_count++; if (!c) return; if(opt.sample_size) { -int lap = (s->count/opt.sample_size)+1, -index =s->count % opt.sample_size; -if((index - (lap/3))%lap == 0) { -if(!s->sample) { -s->sample = malloc(sizeof(*s->sample) * opt.sample_size); -if(!s->sample) { -fprintf(stderr, "%s: malloc failed!\n", __func__); -error(ERR_SYSTEM, NULL); -} +if (s->count >= s->sample_size +&& (s->count == 0 +|| opt.sample_max == 0 +|| s->sample_size < opt.sample_max)) { +int new_size; +void * new_sample = NULL; + +new_size = s->sample_size << 1; Sorry for my ignorance here, but why we have chosen to double the size. Can't we increase by fixed size X where X < double size. Are we sure that we will be able to fully utilize the double sized buffer. + +if (new_size == 0) +new_size = opt.sample_size; + +if (opt.sample_max != 0 && new_size > opt.sample_max) +new_size = opt.sample_max; + +//printf("New size: %d\n", new_size); + +new_sample = realloc(s->sample, sizeof(*s->sample) * new_size); + +if (new_sample) { +s->sample = new_sample; +s->sample_size = new_size; } -s->sample[index]=c; } + +if (s->count < s->sample_size) { +s->sample[s->count]=c; +} else { +/* + * If we run out of space for samples, start taking only a + * subset of samples. + */ +int lap, index; +lap = (s->count/s->sample_size)+1; +index =s->count % s->sample_size; +if((index - (lap/3))%lap == 0) { +s->sample[index]=c; + } + } } s->count++; ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH -v3 1/1] ratelimit: Implement rate limit for credit2 scheduler Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue o
On 03/08/16 11:16, George Dunlap wrote: On 26/07/16 17:21, Dario Faggioli wrote: On Mon, 2016-07-25 at 17:12 +0100, Anshul Makkar wrote: It introduces context-switch rate-limiting. The patch enables the VM The subject, which will become the "title" of the commit, is way too long. That must be a very concise headline of what the patch is about, and should also have tags, specifying what areas of the codebase are affected. So what do you think of this: xen: credit2: implement context switch rate-limiting. It looks like it's just missing a carrage return -- I could fix that up on check-in. On a more technical side, I think that... +if ( prv->ratelimit_us ) +{ +s_time_t ratelimit_min = prv->ratelimit_us; ... this should be: s_time_t ratelimit_min = MICROSECS(prv->ratelimit_us); I realized that by looking at traces and seeing entries for which CSCHED2_MIN_TIMER was being returned, even if I had set sched_ratelimit_us to a value greater than that. Yes, Dario is correct here. There's also a small typo in one of the comments ("onw" instead of "own"). With all those changed: Reviewed-by: George Dunlap <george.dun...@citrix.com> If Anshul and Dario are happy, I can fix all those thing up on commit. Fine with me. -George Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH -v3 1/1] ratelimit: Implement rate limit for credit2 scheduler Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue or be
It introduces context-switch rate-limiting. The patch enables the VM to batch its work and prevents the system from spending most of its time in context switches because of a VM that is waking/sleeping at high rate. ratelimit can be disabled by setting it to 0. Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- Changes in v3: * General fixes based on the review comments from v1 and v2. --- xen/common/sched_credit2.c | 111 ++--- 1 file changed, 85 insertions(+), 26 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index b92226c..d9f39dc 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -377,6 +377,7 @@ struct csched2_private { unsigned int load_precision_shift; unsigned int load_window_shift; +unsigned ratelimit_us; /* each cpupool can have its onw ratelimit */ }; /* @@ -2029,6 +2030,34 @@ csched2_dom_cntl( return rc; } +static int csched2_sys_cntl(const struct scheduler *ops, +struct xen_sysctl_scheduler_op *sc) +{ +int rc = -EINVAL; +xen_sysctl_credit_schedule_t *params = >u.sched_credit; +struct csched2_private *prv = CSCHED2_PRIV(ops); +unsigned long flags; + +switch (sc->cmd ) +{ +case XEN_SYSCTL_SCHEDOP_putinfo: +if ( params->ratelimit_us && +( params->ratelimit_us > XEN_SYSCTL_SCHED_RATELIMIT_MAX || + params->ratelimit_us < XEN_SYSCTL_SCHED_RATELIMIT_MIN )) +return rc; +write_lock_irqsave(>lock, flags); +prv->ratelimit_us = params->ratelimit_us; +write_unlock_irqrestore(>lock, flags); +break; + +case XEN_SYSCTL_SCHEDOP_getinfo: +params->ratelimit_us = prv->ratelimit_us; +rc = 0; +break; +} +return rc; +} + static void * csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom) { @@ -2098,12 +2127,14 @@ csched2_dom_destroy(const struct scheduler *ops, struct domain *dom) /* How long should we let this vcpu run for? */ static s_time_t -csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext) +csched2_runtime(const struct scheduler *ops, int cpu, +struct csched2_vcpu *snext, s_time_t now) { -s_time_t time; +s_time_t time, min_time; int rt_credit; /* Proposed runtime measured in credits */ struct csched2_runqueue_data *rqd = RQD(ops, cpu); struct list_head *runq = >runq; +struct csched2_private *prv = CSCHED2_PRIV(ops); /* * If we're idle, just stay so. Others (or external events) @@ -2116,9 +2147,22 @@ csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext * 1) Run until snext's credit will be 0 * 2) But if someone is waiting, run until snext's credit is equal * to his - * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER. + * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER or + * the ratelimit time. */ +/* Calculate mintime */ +min_time = CSCHED2_MIN_TIMER; +if ( prv->ratelimit_us ) +{ +s_time_t ratelimit_min = prv->ratelimit_us; +if ( snext->vcpu->is_running ) +ratelimit_min = snext->vcpu->runstate.state_entry_time + +MICROSECS(prv->ratelimit_us) - now; +if ( ratelimit_min > min_time ) +min_time = ratelimit_min; +} + /* 1) Basic time: Run until credit is 0. */ rt_credit = snext->credit; @@ -2135,32 +2179,32 @@ csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext } } -/* The next guy may actually have a higher credit, if we've tried to - * avoid migrating him from a different cpu. DTRT. */ -if ( rt_credit <= 0 ) +/* + * The next guy on the runqueue may actually have a higher credit, + * if we've tried to avoid migrating him from a different cpu. + * Setting time=0 will ensure the minimum timeslice is chosen. + * + * FIXME: See if we can eliminate this conversion if we know time + * will be outside (MIN,MAX). Probably requires pre-calculating + * credit values of MIN,MAX per vcpu, since each vcpu burns credit + * at a different rate. + */ +if (rt_credit > 0) +time = c2t(rqd, rt_credit, snext); +else +time = 0; + +/* 3) But never run longer than MAX_TIMER or less than MIN_TIMER or + * the rate_limit time. */ +if ( time < min_time) { -time = CSCHED2_MIN_TIMER; +time = min_time; SCHED_STAT_CRANK(runtime_min_timer); } -else +else if (time > CSCHED2_MAX_TIMER) { -/* FIXME: See if we can eliminate this conversion if we know time - * will be outside (MIN,MAX
Re: [Xen-devel] [PATCH v2 1/1] ratelimit: Implement rate limit for credit2 scheduler Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue or
On 22/07/16 11:36, Dario Faggioli wrote: On Mon, 2016-07-18 at 13:22 +0100, Anshul Makkar wrote: Hey, Anshul. Thanks, and sorry for the delay in reviewing. This version is an improvement, but it looks to me that you've missed a few of the review comments to v1. Sorry about that. !! It introduces a minimum amount of latency "introduces context-switch rate-limiting" diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 8b95a47..68bcdb8 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -1601,6 +1602,34 @@ csched2_dom_cntl( +switch (sc->cmd ) +{ +case XEN_SYSCTL_SCHEDOP_putinfo: +if ( params->ratelimit_us && +( params->ratelimit_us < CSCHED2_MIN_TIMER || + params->ratelimit_us > I remember saying already (although, it may have be in pvt, not on this list) that I think we should just use XEN_SYSCTL_SCHED_RATELIMIT_MAX and XEN_SYSCTL_SCHED_RATELIMIT_MIN here. CSCHED2_MIN_TIMER and CSCHED2_MAX_TIMER are internal implementation details, and I don't like them exposed (although, indirectly) to the user. addressed. +return rc; +spin_lock_irqsave(>lock, flags); + This is ok. However, the code base changed in the meanwhile (sorry! :- P), and this spin_lock_irqsave() needs to become a write_lock_irqsave(). done. Mmm... if you wanted to implement my suggestion from <1468400021.13039.33.ca...@citrix.com>, you're definitely missing something: s_time_t ratelimit_min = prv->ratelimit_us; if ( snext->vcpu->is_running ) ratelimit_min = snext->vcpu->runstate.state_entry_time + MICROSECS(prv->ratelimit_us) - now; yes, missed the if part for checking if the vcpu is currently running. In fact, you're initializing ratelimit_min and then immediately overriding that... I'm surprised the compiler didn't complain. +if ( ratelimit_min > min_time ) +min_time = ratelimit_min; +} + @@ -1707,32 +1749,33 @@ csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext } } @@ -1746,7 +1789,7 @@ void __dump_execstate(void *unused); static struct csched2_vcpu * runq_candidate(struct csched2_runqueue_data *rqd, struct csched2_vcpu *scurr, - int cpu, s_time_t now) + int cpu, s_time_t now, struct csched2_private *prv) Reviewing v1, George said this: Since we have the cpu, I think we can get ops this way, without cluttering things up with the extra argument: struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); yes, missed that change too. Addressed in v3. @@ -1775,9 +1829,13 @@ runq_candidate(struct csched2_runqueue_data *rqd, } /* If the next one on the list has more credit than current - * (or idle, if current is not runnable), choose it. */ + * (or idle, if current is not runnable) and current one has already + * executed for more than ratelimit. choose it. + * Control has reached here means that current vcpu has executed > + * ratelimit_us or ratelimit is off, so chose the next one. + */ if ( svc->credit > snext->credit ) -snext = svc; +snext = svc; Both me and George agreed that changing the comment like this is not helping much and should not be done. Though, I find the extended comment useful, but if you don't agree I will remove it v3. Regards, Dario Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] XSM-docs: Flask operates on domain types and not on individual domain. Updated the documentation to reflect this.
On 19/07/16 14:30, Doug Goldstein wrote: On 7/19/16 4:05 AM, Anshul Makkar wrote: Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- * Resending the patch due to incomplete subject in the previous patch. docs/misc/xsm-flask.txt | 8 1 file changed, 4 insertions(+), 4 deletions(-) --- diff --git a/docs/misc/xsm-flask.txt b/docs/misc/xsm-flask.txt index 62f15dd..bf8bb6e 100644 Some examples of what FLASK can do: - - Prevent two domains from communicating via event channels or grants - - Control which domains can use device passthrough (and which devices) + - Prevent two domains types from communicating via event channels or grants + - Control which type of domains can use device passthrough (and which devices) I disagree with this snippet. This is an example of what you can do with FLASK. You can use flask to do those two actions. Adding the word "types" in there takes it from being a concrete example to being more ambiguous. "Prevent domains belonging to different types to communicate via event channels or grants". Does this sounds better. I think that its important to use the word "type" so that user doesn't get a wrong impression that the policy is per domain, while in actual its per type. - Restrict or audit operations performed by privileged domains - Prevent a privileged domain from arbitrarily mapping pages from other domains @@ -160,10 +160,10 @@ the policy can be reloaded using "xl loadpolicy". The example policy included with Xen demonstrates most of the features of FLASK that can be used without dom0 disaggregation. The main types for domUs are: - - domU_t is a domain that can communicate with any other domU_t + - domU_t is a domain type that can communicate with any other domU_t types. "A domain labeled with domU_t can communicate with any other domain labeled with type domU_t." Rephrasing is fine. - isolated_domU_t can only communicate with dom0 - prot_domU_t is a domain type whose creation can be disabled with a boolean - - nomigrate_t is a domain that must be created via the nomigrate_t_building + - nomigrate_t is a domain type that must be created via the nomigrate_t_building type, and whose memory cannot be read by dom0 once created "A domain labeled with nomigeate_t is a domain that" Rephrasing is fine. HVM domains with stubdomain device models also need a type for the stub domain. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] XSM-docs: Flask operates on domain types and not on individual domain. Updated the documentation to reflect this.
Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- * Resending the patch due to incomplete subject in the previous patch. docs/misc/xsm-flask.txt | 8 1 file changed, 4 insertions(+), 4 deletions(-) --- diff --git a/docs/misc/xsm-flask.txt b/docs/misc/xsm-flask.txt index 62f15dd..bf8bb6e 100644 --- a/docs/misc/xsm-flask.txt +++ b/docs/misc/xsm-flask.txt @@ -9,8 +9,8 @@ controls over Xen domains, allowing the policy writer to define what interactions between domains, devices, and the hypervisor are permitted. Some examples of what FLASK can do: - - Prevent two domains from communicating via event channels or grants - - Control which domains can use device passthrough (and which devices) + - Prevent two domains types from communicating via event channels or grants + - Control which type of domains can use device passthrough (and which devices) - Restrict or audit operations performed by privileged domains - Prevent a privileged domain from arbitrarily mapping pages from other domains @@ -160,10 +160,10 @@ the policy can be reloaded using "xl loadpolicy". The example policy included with Xen demonstrates most of the features of FLASK that can be used without dom0 disaggregation. The main types for domUs are: - - domU_t is a domain that can communicate with any other domU_t + - domU_t is a domain type that can communicate with any other domU_t types. - isolated_domU_t can only communicate with dom0 - prot_domU_t is a domain type whose creation can be disabled with a boolean - - nomigrate_t is a domain that must be created via the nomigrate_t_building + - nomigrate_t is a domain type that must be created via the nomigrate_t_building type, and whose memory cannot be read by dom0 once created HVM domains with stubdomain device models also need a type for the stub domain. -- 1.9.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] XSM-docs: Flask operates on domain types and not on
Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- docs/misc/xsm-flask.txt | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/misc/xsm-flask.txt b/docs/misc/xsm-flask.txt index 62f15dd..bf8bb6e 100644 --- a/docs/misc/xsm-flask.txt +++ b/docs/misc/xsm-flask.txt @@ -9,8 +9,8 @@ controls over Xen domains, allowing the policy writer to define what interactions between domains, devices, and the hypervisor are permitted. Some examples of what FLASK can do: - - Prevent two domains from communicating via event channels or grants - - Control which domains can use device passthrough (and which devices) + - Prevent two domains types from communicating via event channels or grants + - Control which type of domains can use device passthrough (and which devices) - Restrict or audit operations performed by privileged domains - Prevent a privileged domain from arbitrarily mapping pages from other domains @@ -160,10 +160,10 @@ the policy can be reloaded using "xl loadpolicy". The example policy included with Xen demonstrates most of the features of FLASK that can be used without dom0 disaggregation. The main types for domUs are: - - domU_t is a domain that can communicate with any other domU_t + - domU_t is a domain type that can communicate with any other domU_t types. - isolated_domU_t can only communicate with dom0 - prot_domU_t is a domain type whose creation can be disabled with a boolean - - nomigrate_t is a domain that must be created via the nomigrate_t_building + - nomigrate_t is a domain type that must be created via the nomigrate_t_building type, and whose memory cannot be read by dom0 once created HVM domains with stubdomain device models also need a type for the stub domain. -- 1.9.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 1/1] ratelimit: Implement rate limit for credit2 scheduler Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue or bei
It introduces a minimum amount of latency to enable a VM to batch its work and it also ensures that system is not spending most of its time in VMEXIT/VMENTRY because of VM that is waking/sleeping at high rate. ratelimit can be disabled by setting it to 0. Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- Changes in v2: * algo for time slice calculation based on ratelimit_us has changed. * initial value of prv->ratelimit_us. * other changes based on review comments. --- xen/common/sched_credit2.c | 124 + 1 file changed, 93 insertions(+), 31 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 8b95a47..68bcdb8 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -280,6 +280,7 @@ struct csched2_private { struct csched2_runqueue_data rqd[NR_CPUS]; unsigned int load_window_shift; +unsigned ratelimit_us; /* each cpupool can have its onw ratelimit */ }; /* @@ -1601,6 +1602,34 @@ csched2_dom_cntl( return rc; } +static int csched2_sys_cntl(const struct scheduler *ops, +struct xen_sysctl_scheduler_op *sc) +{ +int rc = -EINVAL; +xen_sysctl_credit_schedule_t *params = >u.sched_credit; +struct csched2_private *prv = CSCHED2_PRIV(ops); +unsigned long flags; + +switch (sc->cmd ) +{ +case XEN_SYSCTL_SCHEDOP_putinfo: +if ( params->ratelimit_us && +( params->ratelimit_us < CSCHED2_MIN_TIMER || + params->ratelimit_us > MICROSECS(CSCHED2_MAX_TIMER) )) +return rc; +spin_lock_irqsave(>lock, flags); +prv->ratelimit_us = params->ratelimit_us; +spin_unlock_irqrestore(>lock, flags); +break; + +case XEN_SYSCTL_SCHEDOP_getinfo: +params->ratelimit_us = prv->ratelimit_us; +rc = 0; +break; +} +return rc; +} + static void * csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom) { @@ -1670,12 +1699,14 @@ csched2_dom_destroy(const struct scheduler *ops, struct domain *dom) /* How long should we let this vcpu run for? */ static s_time_t -csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext) +csched2_runtime(const struct scheduler *ops, int cpu, +struct csched2_vcpu *snext, s_time_t now) { -s_time_t time; +s_time_t time, min_time; int rt_credit; /* Proposed runtime measured in credits */ struct csched2_runqueue_data *rqd = RQD(ops, cpu); struct list_head *runq = >runq; +struct csched2_private *prv = CSCHED2_PRIV(ops); /* * If we're idle, just stay so. Others (or external events) @@ -1688,9 +1719,20 @@ csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext * 1) Run until snext's credit will be 0 * 2) But if someone is waiting, run until snext's credit is equal * to his - * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER. + * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER or + * run for ratelimit time. */ +/* Calculate mintime */ +min_time = CSCHED2_MIN_TIMER; +if ( prv->ratelimit_us ) { +s_time_t ratelimit_min = prv->ratelimit_us; +ratelimit_min = snext->vcpu->runstate.state_entry_time + +MICROSECS(prv->ratelimit_us) - now; +if ( ratelimit_min > min_time ) +min_time = ratelimit_min; +} + /* 1) Basic time: Run until credit is 0. */ rt_credit = snext->credit; @@ -1707,32 +1749,33 @@ csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext } } -/* The next guy may actually have a higher credit, if we've tried to - * avoid migrating him from a different cpu. DTRT. */ -if ( rt_credit <= 0 ) +/* + * The next guy ont the runqueue may actually have a higher credit, + * if we've tried to avoid migrating him from a different cpu. + * Setting time=0 will ensure the minimum timeslice is chosen. + * FIXME: See if we can eliminate this conversion if we know time + * will be outside (MIN,MAX). Probably requires pre-calculating + * credit values of MIN,MAX per vcpu, since each vcpu burns credit + * at a different rate. + */ +if (rt_credit > 0) +time = c2t(rqd, rt_credit, snext); +else +time = 0; + +/* + * Never run longer than MAX_TIMER or less than MIN_TIMER or for + * rate_limit time. + */ +if ( time < min_time) { -time = CSCHED2_MIN_TIMER; -SCHED_STAT_CRANK(runtime_min_timer); + time = min_time; + SCHED_STAT_CRANK(runtime_min_timer); } -else +else if (time > CSCHED2_MAX_TIMER) { -/* FIXME: See if we can eliminate this conversi
Re: [Xen-devel] [PATCH] credi2-ratelimit: Implement rate limit for credit2 scheduler
On 13/07/16 09:53, Dario Faggioli wrote: On Tue, 2016-07-12 at 17:16 +0100, George Dunlap wrote: On Wed, Jul 6, 2016 at 6:33 PM, Makkar anshul.mak...@citrix.com wrote: --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c +#define MAX_TSLICE(t1, t2) \ + ({ typeof (t1) _t1 = (t1); \ + typeof (t1) _t2 = (t2); \ + _t1 > _t2 ? _t1 < 0 ? 0 : _t1 : _t2 < 0 ? 0 : _t2; }) Hiding the zero-checking behind this seems a bit strange to me. I think if the algorithm is properly laid out, we probably don't need this at all (see my proposed alternate below). I think it's more "overflow-avoidance" than "zero-checking", and as I said, that's probably better done by means of subtraction(s). In any case, I agree that, if we don't need it, that's even better. :-) Agreed. I am going to remove it. Its not needed with modified algo. This is all really confusing. What about something like this (also attached)? The basic idea is to calculate min_time, then go through the normal algorithm, then clip it once at the end. Yes, this matches my idea and my comment, and I think the code provided by George makes sense. Only one thing: @@ -1675,9 +1711,19 @@ csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext + * or your the ratelimit time. */ +/* Calculate mintime */ +min_time = CSCHED2_MIN_TIMER; +if ( prv->ratelimit_us ) { +s_time_t ratelimit_min = snext->vcpu- runstate.state_entry_time + +MICROSECS(prv->ratelimit_us) - now; Here snext can indeed be someone which was running already (e.g., we're choosing current again), in which case runstate.state_entry-time-now would indeed tell us how long it's actually been running, and the formula (coupled with the if below) is correct. But it also can be someone which is runnable (e.g., we're choosing someone from the runqueue and preempting current), in which case runstate.state_entry_time tells when it became runnable, and state_entry_time-now is how long it's been runnable, which is not what we want here. In think, in such a case, we want ratelimit_min to just be equal to prv->ratelimit_us. So, maybe, something like this: /* Caluclate mintime */ min_time = CSCHED2_MIN_TIMER; if ( prv->ratelimit_us ) { s_time_t ratelimit_min = prv->ratelimit_us; if ( snext->vcpu->is_running ) // XXX or is it better snext == curr_on_cpu(cpu) ratelimit_min = snext->vcpu->runstate.state_entry_time + MICROSECS(prv->ratelimit_us) - now; if ( ratelimit_min > min_time ) min_time = ratelimit_min; } Agreed. @@ -2353,6 +2431,8 @@ csched2_init(struct scheduler *ops) prv->runq_map[i] = -1; prv->rqd[i].id = -1; } +/* initialize ratelimit */ +prv->ratelimit_us = 1000 * CSCHED2_MIN_TIMER; Is there any reason this isn't using sched_ratelimit_us (the hypervisor command-line option, which the credit scheduler is using)? Yeah, I guess just using that is the best thing to start with. Agreed. Regards, Dario Will post the reworked patch. Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2] XSM-Policy: allow source domain access to setpodtarget and getpodtarget for ballooning.
Access to setpodtarget and getpodtarget is required by dom0 to set the balloon targets for domU. The patch gives source domain (dom0) access to set this target for domU and resolve the following permission denied erro message during ballooning : avc: denied { setpodtarget } for domid=0 target=9 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=domain Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> Acked-by: Daniel De Graaf <dgde...@tycho.nsa.gov> --- Changed Since V1: * added getpodtarget. tools/flask/policy/modules/xen.if | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/flask/policy/modules/xen.if b/tools/flask/policy/modules/xen.if index 8c43c28..dbefa1e 100644 --- a/tools/flask/policy/modules/xen.if +++ b/tools/flask/policy/modules/xen.if @@ -83,7 +83,8 @@ define(`create_domain_build_label', ` define(`manage_domain', ` allow $1 $2:domain { getdomaininfo getvcpuinfo getaffinity getaddrsize pause unpause trigger shutdown destroy - setaffinity setdomainmaxmem getscheduler resume }; + setaffinity setdomainmaxmem getscheduler resume + setpodtarget getpodtarget }; allow $1 $2:domain2 set_vnumainfo; ') -- 1.9.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] XSM-Policy: allow source domain access to setpodtarget for ballooning.
Access to setpodtarget is required by dom0 to set the balloon targets for domU. The patch gives source domain (dom0) access to set this target for domU and resolve the following permission denied error message during ballooning : avc: denied { setpodtarget } for domid=0 target=9 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=domain Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- --- tools/flask/policy/modules/xen.if | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/flask/policy/modules/xen.if b/tools/flask/policy/modules/xen.if index 8c43c28..8ae3c2e 100644 --- a/tools/flask/policy/modules/xen.if +++ b/tools/flask/policy/modules/xen.if @@ -83,7 +83,8 @@ define(`create_domain_build_label', ` define(`manage_domain', ` allow $1 $2:domain { getdomaininfo getvcpuinfo getaffinity getaddrsize pause unpause trigger shutdown destroy - setaffinity setdomainmaxmem getscheduler resume }; + setaffinity setdomainmaxmem getscheduler resume + setpodtarget }; allow $1 $2:domain2 set_vnumainfo; ') -- 1.9.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] default XSM policy for PCI passthrough for unlabeled resources.
On 07/07/16 16:36, Daniel De Graaf wrote: On 07/06/2016 12:19 PM, anshul makkar wrote: On 06/07/16 16:59, Daniel De Graaf wrote: On 07/06/2016 11:34 AM, anshul makkar wrote: Hi, It allows the resource to be added and removed by the source domain to target domain, but its use by target domain is blocked. This rule only mandates the use of resource_type for resource types. If you are creating a new resource type, follow the example in nic_dev.te. Agreed, but inherently it means that "use" of any unlabeled resource be it irq, ioport or iomem or nic_dev is restricted. Restricted how? The fallback types have the resource_type attribute. Restricted if they are unlabeled. Neverallow rules are actually not present in the binary policy; they act as compile-time assertions in the policy build. Fine. The resource can be used only if it has been labeled using flask-label-pci command which needs to be rerun after every boot and after every policy reload. Try adding a module with the following rules, which should allow domU to use unlabeled devices: use_device(domU_t, irq_t) use_device(domU_t, ioport_t) use_device(domU_t, iomem_t) use_device(domU_t, device_t) Yes, it does work , but I have added these in delegate_device to make it restrict to the case where there is delegation. This prevents using delegate_devices without allowing access to unlabeled devices. If you think this should be a macro, I would suggest making a new one named something like "delegate_unlabeled_devices". Agreed. That's a better approach. I believe this macro can make the default policy more flexible and useful for more general audience, so it should be there in the policy. I can submit patch for the same. Your thoughts ? ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] XSM/policy: Allow the source domain access to settime and setdomainhandle domctls while creating domain.
From: Anshul Makkar <anshul.mak...@citrix.com> This patch resolves the following permission denied scenarios while creating new domU : avc: denied { setdomainhandle } for domid=0 target=1 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=domain avc: denied { settime } for domid=0 target=1 scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t tclass=domain Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- tools/flask/policy/modules/xen.if | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/flask/policy/modules/xen.if b/tools/flask/policy/modules/xen.if index fd96303..8c43c28 100644 --- a/tools/flask/policy/modules/xen.if +++ b/tools/flask/policy/modules/xen.if @@ -48,7 +48,8 @@ define(`declare_build_label', ` define(`create_domain_common', ` allow $1 $2:domain { create max_vcpus setdomainmaxmem setaddrsize getdomaininfo hypercall setvcpucontext getscheduler - getvcpuinfo getaddrsize getaffinity setaffinity }; + getvcpuinfo getaddrsize getaffinity setaffinity + settime setdomainhandle }; allow $1 $2:domain2 { set_cpuid settsc setscheduler setclaim set_max_evtchn set_vnumainfo get_vnumainfo cacheflush psr_cmt_op psr_cat_op soft_reset }; -- 1.9.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] credi2-ratelimit: Implement rate limit for credit2 scheduler
From: Anshul Makkar <anshul.mak...@citrix.com> Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue or being preempted by higher priority thread. It introduces a minimum amount of latency to enable a VM to batch its work and it also ensures that system is not spending most of its time in VMEXIT/VMENTRY because of VM that is waking/sleeping at high rate. ratelimit can be disabled by setting it to 0. Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- --- xen/common/sched_credit2.c | 115 ++--- 1 file changed, 98 insertions(+), 17 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 1933ff1..6718574 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -171,6 +171,11 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist); #define c2r(_ops, _cpu) (CSCHED2_PRIV(_ops)->runq_map[(_cpu)]) /* CPU to runqueue struct macro */ #define RQD(_ops, _cpu) (_PRIV(_ops)->rqd[c2r(_ops, _cpu)]) +/* Find the max of time slice */ +#define MAX_TSLICE(t1, t2) \ + ({ typeof (t1) _t1 = (t1); \ + typeof (t1) _t2 = (t2); \ + _t1 > _t2 ? _t1 < 0 ? 0 : _t1 : _t2 < 0 ? 0 : _t2; }) /* * Shifts for load average. @@ -280,6 +285,7 @@ struct csched2_private { struct csched2_runqueue_data rqd[NR_CPUS]; unsigned int load_window_shift; +unsigned ratelimit_us; /* each cpupool can have its onw ratelimit */ }; /* @@ -1588,6 +1594,34 @@ csched2_dom_cntl( return rc; } +static int csched2_sys_cntl(const struct scheduler *ops, +struct xen_sysctl_scheduler_op *sc) +{ +int rc = -EINVAL; +xen_sysctl_credit_schedule_t *params = >u.sched_credit; +struct csched2_private *prv = CSCHED2_PRIV(ops); +unsigned long flags; + +switch (sc->cmd ) +{ +case XEN_SYSCTL_SCHEDOP_putinfo: +if ( params->ratelimit_us && +( params->ratelimit_us < CSCHED2_MIN_TIMER || + params->ratelimit_us > MICROSECS(CSCHED2_MAX_TIMER) )) +return rc; +spin_lock_irqsave(>lock, flags); +prv->ratelimit_us = params->ratelimit_us; +spin_unlock_irqrestore(>lock, flags); +break; + +case XEN_SYSCTL_SCHEDOP_getinfo: +params->ratelimit_us = prv->ratelimit_us; +rc = 0; +break; +} +return rc; +} + static void * csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom) { @@ -1657,12 +1691,15 @@ csched2_dom_destroy(const struct scheduler *ops, struct domain *dom) /* How long should we let this vcpu run for? */ static s_time_t -csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext) +csched2_runtime(const struct scheduler *ops, int cpu, +struct csched2_vcpu *snext, s_time_t now) { -s_time_t time; +s_time_t time; int rt_credit; /* Proposed runtime measured in credits */ struct csched2_runqueue_data *rqd = RQD(ops, cpu); struct list_head *runq = >runq; +s_time_t runtime = 0; +struct csched2_private *prv = CSCHED2_PRIV(ops); /* * If we're idle, just stay so. Others (or external events) @@ -1680,6 +1717,14 @@ csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext /* 1) Basic time: Run until credit is 0. */ rt_credit = snext->credit; +if (snext->vcpu->is_running) +runtime = now - snext->vcpu->runstate.state_entry_time; +if ( runtime < 0 ) +{ +runtime = 0; +d2printk("%s: Time went backwards? now %"PRI_stime" state_entry_time %"PRI_stime"\n", + _func__, now, snext->runstate.state_entry_time); +} /* 2) If there's someone waiting whose credit is positive, * run until your credit ~= his */ @@ -1695,11 +1740,24 @@ csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext } /* The next guy may actually have a higher credit, if we've tried to - * avoid migrating him from a different cpu. DTRT. */ + * avoid migrating him from a different cpu. DTRT. + * Even if the next guy has higher credit and current vcpu has executed + * for less amount of time than rate limit, allow it to run for minimum + * amount of time. + */ if ( rt_credit <= 0 ) { -time = CSCHED2_MIN_TIMER; -SCHED_STAT_CRANK(runtime_min_timer); +if ( snext->vcpu->is_running && prv->ratelimit_us) + /* implies the current one has executed for time < ratelimit and thus +* it has neen selcted int runq_candidate to run next. +* No need to check f
Re: [Xen-devel] default XSM policy for PCI passthrough for unlabeled resources.
On 06/07/16 16:59, Daniel De Graaf wrote: On 07/06/2016 11:34 AM, anshul makkar wrote: Hi, It allows the resource to be added and removed by the source domain to target domain, but its use by target domain is blocked. This rule only mandates the use of resource_type for resource types. If you are creating a new resource type, follow the example in nic_dev.te. Agreed, but inherently it means that "use" of any unlabeled resource be it irq, ioport or iomem or nic_dev is restricted. The resource can be used only if it has been labeled using flask-label-pci command which needs to be rerun after every boot and after every policy reload. Yes; this gives the most control over what resources can be delegated. Policy reloads are supposed to be rare (on a production system) and you already need special boot scripts (or parameters) to set up the device for passthrough, so this can be added there. However, I agree this can be more work than a "default" FLASK policy should require. Try adding a module with the following rules, which should allow domU to use unlabeled devices: use_device(domU_t, irq_t) use_device(domU_t, ioport_t) use_device(domU_t, iomem_t) use_device(domU_t, device_t) Yes, it does work , but I have added these in delegate_device to make it restrict to the case where there is delegation. If this works, that module could be added to the default policy. Given that what we ship is just a sample default policy for reference which is expected to be permissive in most of the scenarios so that it doesn't affect the basic functionalities, is this "neverallow" rule needed ? Thanks Anshul Makkar The neverallow rules are just there to ensure that the attributes are being used correctly. Anshul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] default XSM policy for PCI passthrough for unlabeled resources.
Hi, Default XSM policy doesn't allow the use of unlabeled PCI resources that have been passed through to target domain. xen.te # Resources must be declared using . resource_type neverallow * ~resource_type:resource use; It allows the resource to be added and removed by the source domain to target domain, but its use by target domain is blocked. The resource can be used only if it has been labeled using flask-label-pci command which needs to be rerun after every boot and after every policy reload. The above approach reduces the flexibility and necessitates admin intervention to give passthrough rights after host has booted. Also, in general if I want to allow all domUs to have PCI passthough access to all PCI device, I have no other way apart from disabling the "neverallow" rule. Given that what we ship is just a sample default policy for reference which is expected to be permissive in most of the scenarios so that it doesn't affect the basic functionalities, is this "neverallow" rule needed ? Thanks Anshul Makkar ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement
On 18/06/16 00:13, Dario Faggioli wrote: In fact, right now, we recommend keepeing runqueues arranged per-core, so that it is the inter-runqueue load balancing code that automatically spreads the work in an SMT friendly way. This means that any other runq arrangement one may want to use falls short of SMT scheduling optimizations. This commit implements SMT awareness --similar to the one we have in Credit1-- for any possible runq arrangement. This turned out to be pretty easy to do, as the logic can live entirely in runq_tickle() (although, in order to avoid for_each_cpu loops in that function, we use a new cpumask which indeed needs to be updated in other places). In addition to disentangling SMT awareness from load balancing, this also allows us to support the sched_smt_power_savings parametar in Credit2 as well. Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> Reviewed-by: Anshul Makkar <anshul.mak...@citrix.com> --- Cc: George Dunlap <george.dun...@citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> Cc: David Vrabel <david.vra...@citrix.com> --- xen/common/sched_credit2.c | 141 +++- 1 file changed, 126 insertions(+), 15 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 93943fa..a8b3a85 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -351,7 +351,8 @@ struct csched2_runqueue_data { unsigned int max_weight; cpumask_t idle,/* Currently idle */ -tickled; /* Another cpu in the queue is already targeted for this one */ +smt_idle, /* Fully idle cores (as in all the siblings are idle) */ +tickled; /* Have been asked to go through schedule */ int load; /* Instantaneous load: Length of queue + num non-idle threads */ s_time_t load_last_update; /* Last time average was updated */ s_time_t avgload; /* Decaying queue load */ @@ -412,6 +413,73 @@ struct csched2_dom { }; /* + * Hyperthreading (SMT) support. + * + * We use a special per-runq mask (smt_idle) and update it according to the + * following logic: + * - when _all_ the SMT sibling in a core are idle, all their corresponding + *bits are set in the smt_idle mask; + * - when even _just_one_ of the SMT siblings in a core is not idle, all the + *bits correspondings to it and to all its siblings are clear in the + *smt_idle mask. + * + * Once we have such a mask, it is easy to implement a policy that, either: + * - uses fully idle cores first: it is enough to try to schedule the vcpus + *on pcpus from smt_idle mask first. This is what happens if + *sched_smt_power_savings was not set at boot (default), and it maximizes + *true parallelism, and hence performance; + * - uses already busy cores first: it is enough to try to schedule the vcpus + *on pcpus that are idle, but are not in smt_idle. This is what happens if + *sched_smt_power_savings is set at boot, and it allows as more cores as + *possible to stay in low power states, minimizing power consumption. + * + * This logic is entirely implemented in runq_tickle(), and that is enough. + * In fact, in this scheduler, placement of a vcpu on one of the pcpus of a + * runq, _always_ happens by means of tickling: + * - when a vcpu wakes up, it calls csched2_vcpu_wake(), which calls + *runq_tickle(); + * - when a migration is initiated in schedule.c, we call csched2_cpu_pick(), + *csched2_vcpu_migrate() (which calls migrate()) and csched2_vcpu_wake(). + *csched2_cpu_pick() looks for the least loaded runq and return just any + *of its processors. Then, csched2_vcpu_migrate() just moves the vcpu to + *the chosen runq, and it is again runq_tickle(), called by + *csched2_vcpu_wake() that actually decides what pcpu to use within the + *chosen runq; + * - when a migration is initiated in sched_credit2.c, by calling migrate() + *directly, that again temporarily use a random pcpu from the new runq, + *and then calls runq_tickle(), by itself. + */ + +/* + * If all the siblings of cpu (including cpu itself) are in idlers, + * set all their bits in mask. + * + * In order to properly take into account tickling, idlers needs to be + * set qeual to something like: + * + * rqd->idle & (~rqd->tickled) + * + * This is because cpus that have been tickled will very likely pick up some + * work as soon as the manage to schedule, and hence we should really consider + * them as busy. + */ +static inline +void smt_idle_mask_set(unsigned int cpu, cpumask_t *idlers, cpumask_t *mask) +{ +if ( cpumask_subset( per_cpu(cpu_sibling_mask, cpu), idlers) ) +cpumask_or(mask, mask, per_cpu(cpu_sibling_mask, cpu)); +} + +/* + * Clear the bits of all the siblings of cpu from mask. + */ +static inline +void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask) +{
Re: [Xen-devel] [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter
On 18/06/16 00:11, Dario Faggioli wrote: In fact, they always operate on the svc->processor of the csched2_vcpu passed to them. No functional change intended. Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> Reviewed-by: Anshul Makkar <anshul.mak...@citrix.com> --- Cc: George Dunlap <george.dnu...@citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> Cc: David Vrabel <david.vra...@citrix.com> --- xen/common/sched_credit2.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 0246453..5881583 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -518,8 +518,9 @@ __runq_insert(struct list_head *runq, struct csched2_vcpu *svc) } static void -runq_insert(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *svc) +runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc) { +unsigned int cpu = svc->vcpu->processor; struct list_head * runq = (ops, cpu)->runq; int pos = 0; @@ -558,17 +559,17 @@ void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *, s_ti /* Check to see if the item on the runqueue is higher priority than what's * currently running; if so, wake up the processor */ static /*inline*/ void -runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *new, s_time_t now) +runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now) { int i, ipid=-1; s_time_t lowest=(1<<30); +unsigned int cpu = new->vcpu->processor; struct csched2_runqueue_data *rqd = RQD(ops, cpu); cpumask_t mask; struct csched2_vcpu * cur; d2printk("rqt %pv curr %pv\n", new->vcpu, current); -BUG_ON(new->vcpu->processor != cpu); BUG_ON(new->rqd != rqd); /* Look at the cpu it's running on first */ @@ -1071,8 +1072,8 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc) update_load(ops, svc->rqd, svc, 1, now); /* Put the VCPU on the runq */ -runq_insert(ops, vc->processor, svc); -runq_tickle(ops, vc->processor, svc, now); +runq_insert(ops, svc); +runq_tickle(ops, svc, now); out: d2printk("w-\n"); @@ -1104,8 +1105,8 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc) { BUG_ON(__vcpu_on_runq(svc)); -runq_insert(ops, vc->processor, svc); -runq_tickle(ops, vc->processor, svc, now); +runq_insert(ops, svc); +runq_tickle(ops, svc, now); } else if ( !is_idle_vcpu(vc) ) update_load(ops, svc->rqd, svc, -1, now); @@ -1313,8 +1314,8 @@ static void migrate(const struct scheduler *ops, if ( on_runq ) { update_load(ops, svc->rqd, NULL, 1, now); -runq_insert(ops, svc->vcpu->processor, svc); -runq_tickle(ops, svc->vcpu->processor, svc, now); +runq_insert(ops, svc); +runq_tickle(ops, svc, now); SCHED_STAT_CRANK(migrate_on_runq); } else ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone.
On 18/06/16 00:11, Dario Faggioli wrote: In both Credit1 and Credit2, stop considering a pCPU idle, if the reason why the idle vCPU is being selected, is to do tasklet work. Not doing so means that the tickling and load balancing logic, seeing the pCPU as idle, considers it a candidate for picking up vCPUs. But the pCPU won't actually pick up or schedule any vCPU, which would then remain in the runqueue, which is bas, especially if there were other, truly idle pCPUs, that could execute it. The only drawback is that we can't assume that a pCPU is in always marked as idle when being removed from an instance of the Credit2 scheduler (csched2_deinit_pdata). In fact, if we are in stop-machine (i.e., during suspend or shutdown), the pCPUs are running the stopmachine_tasklet and hence are actually marked as busy. On the other hand, when removing a pCPU from a Credit2 pool, it will indeed be idle. The only thing we can do, therefore, is to remove the BUG_ON() check. Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com> Reviewed-by: Anshul Makkar <anshul.mak...@citrix.com> --- Cc: George Dunlap <george.dun...@citrix.com> Cc: Anshul Makkar <anshul.mak...@citrix.com> Cc: David Vrabel <david.vra...@citrix.com> --- xen/common/sched_credit.c | 12 ++-- xen/common/sched_credit2.c | 14 ++ 2 files changed, 16 insertions(+), 10 deletions(-) diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index a38a63d..a6645a2 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -1819,24 +1819,24 @@ csched_schedule( else snext = csched_load_balance(prv, cpu, snext, ); + out: /* * Update idlers mask if necessary. When we're idling, other CPUs * will tickle us when they get extra work. */ -if ( snext->pri == CSCHED_PRI_IDLE ) +if ( tasklet_work_scheduled || snext->pri != CSCHED_PRI_IDLE ) { -if ( !cpumask_test_cpu(cpu, prv->idlers) ) -cpumask_set_cpu(cpu, prv->idlers); +if ( cpumask_test_cpu(cpu, prv->idlers) ) +cpumask_clear_cpu(cpu, prv->idlers); } -else if ( cpumask_test_cpu(cpu, prv->idlers) ) +else if ( !cpumask_test_cpu(cpu, prv->idlers) ) { -cpumask_clear_cpu(cpu, prv->idlers); +cpumask_set_cpu(cpu, prv->idlers); } if ( !is_idle_vcpu(snext->vcpu) ) snext->start_time += now; -out: /* * Return task to run next... */ diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 1933ff1..cf8455c 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -1910,8 +1910,16 @@ csched2_schedule( } else { -/* Update the idle mask if necessary */ -if ( !cpumask_test_cpu(cpu, >idle) ) +/* + * Update the idle mask if necessary. Note that, if we're scheduling + * idle in order to carry on some tasklet work, we want to play busy! + */ +if ( tasklet_work_scheduled ) +{ +if ( cpumask_test_cpu(cpu, >idle) ) +cpumask_clear_cpu(cpu, >idle); +} +else if ( !cpumask_test_cpu(cpu, >idle) ) cpumask_set_cpu(cpu, >idle); /* Make sure avgload gets updated periodically even * if there's no activity */ @@ -2291,8 +2299,6 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu) /* No need to save IRQs here, they're already disabled */ spin_lock(>lock); -BUG_ON(!cpumask_test_cpu(cpu, >idle)); - printk("Removing cpu %d from runqueue %d\n", cpu, rqi); cpumask_clear_cpu(cpu, >idle); ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4 03/34] xsm/xen_version: Add XSM for the xen_version hypercall
On 15/03/16 17:56, Konrad Rzeszutek Wilk wrote: All of XENVER_* have now an XSM check for their sub-ops. The subop for XENVER_commandline is now a priviliged operation. To not break guests we still return an string - but it is just '\0'. The rest: XENVER_[version|extraversion|capabilities| parameters|get_features|page_size|guest_handle|changeset| compile_info] behave as before - allowed by default for all guests if using the XSM default policy or with the dummy one. The admin can choose to change the sub-ops to be denied as they see fit. Also we add a local variable block. Signed-off-by: Konrad Rzeszutek Wilk--- Cc: Daniel De Graaf Cc: Ian Jackson Cc: Stefano Stabellini Cc: Wei Liu v2: Do XSM check for all the XENVER_ ops. v3: Add empty data conditions. v4: Return for priv subops. v5: Move extraversion from priv to normal. Drop the XSM check for the non-priv subops. v6: Add +1 for strlen(xen_deny()) to include NULL. Move changeset, compile_info to non-priv subops. v7: Remove the \0 on xen_deny() v8: Add new XSM domain for xenver hypercall. Add all subops to it. v9: Remove the extra line, Add Ack from Daniel v10: Rename the XSM from xen_version_op to xsm_xen_version. Prefix the types with 'xen' to distinguish it from another hypercall performing similar operation. Removed Ack from Daniel as it was so large. Add local variable block. --- tools/flask/policy/policy/modules/xen/xen.te | 15 xen/common/kernel.c | 53 +--- xen/common/version.c | 15 xen/include/xen/version.h| 2 +- xen/include/xsm/dummy.h | 22 xen/include/xsm/xsm.h| 5 +++ xen/xsm/dummy.c | 1 + xen/xsm/flask/hooks.c| 43 ++ xen/xsm/flask/policy/access_vectors | 28 +++ xen/xsm/flask/policy/security_classes| 1 + 10 files changed, 172 insertions(+), 13 deletions(-) diff --git a/tools/flask/policy/policy/modules/xen/xen.te b/tools/flask/policy/policy/modules/xen/xen.te index d35ae22..7e7400d 100644 --- a/tools/flask/policy/policy/modules/xen/xen.te +++ b/tools/flask/policy/policy/modules/xen/xen.te @@ -73,6 +73,15 @@ allow dom0_t xen_t:xen2 { pmu_ctrl get_symbol }; + +# Allow dom0 to use all XENVER_ subops +# Note that dom0 is part of domain_type so this has duplicates. +allow dom0_t xen_t:version { +xen_version xen_extraversion xen_compile_info xen_capabilities +xen_changeset xen_platform_parameters xen_get_features xen_pagesize +xen_guest_handle xen_commandline +}; + allow dom0_t xen_t:mmu memorymap; # Allow dom0 to use these domctls on itself. For domctls acting on other @@ -137,6 +146,12 @@ if (guest_writeconsole) { # pmu_ctrl is for) allow domain_type xen_t:xen2 pmu_use; +# For normal guests all except XENVER_commandline +allow domain_type xen_t:version { +xen_version xen_extraversion xen_compile_info xen_capabilities +xen_changeset xen_platform_parameters xen_get_features xen_pagesize +xen_guest_handle +}; ### # # Domain creation diff --git a/xen/common/kernel.c b/xen/common/kernel.c index 0618da2..2699ac0 100644 --- a/xen/common/kernel.c +++ b/xen/common/kernel.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -223,12 +224,15 @@ void __init do_initcalls(void) /* * Simple hypercalls. */ - DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) { +bool_t deny = !!xsm_xen_version(XSM_OTHER, cmd); + switch ( cmd ) { case XENVER_version: +if ( deny ) +return 0; return (xen_major_version() << 16) | xen_minor_version(); case XENVER_extraversion: @@ -236,7 +240,7 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) xen_extraversion_t extraversion; memset(extraversion, 0, sizeof(extraversion)); -safe_strcpy(extraversion, xen_extra_version()); +safe_strcpy(extraversion, deny ? xen_deny() : xen_extra_version()); if ( copy_to_guest(arg, extraversion, ARRAY_SIZE(extraversion)) ) return -EFAULT; return 0; @@ -247,10 +251,10 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) xen_compile_info_t info; memset(, 0, sizeof(info)); -safe_strcpy(info.compiler, xen_compiler()); -safe_strcpy(info.compile_by, xen_compile_by()); -safe_strcpy(info.compile_domain, xen_compile_domain()); -safe_strcpy(info.compile_date, xen_compile_date()); +safe_strcpy(info.compiler, deny ? xen_deny() :
Re: [Xen-devel] [PATCH] flask: change default state to enforcing
-Original Message- From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of Konrad Rzeszutek Wilk Sent: 10 March 2016 19:12 To: Daniel De Graaf; Ian Jackson ; jbeul...@suse.com Cc: xen-de...@lists.xenproject.org; car...@cardoe.com; Andrew Cooper Subject: Re: [Xen-devel] [PATCH] flask: change default state to enforcing On Thu, Mar 10, 2016 at 01:30:29PM -0500, Daniel De Graaf wrote: I've added Ian and Jan on the email as scripts/get_maintainer.pl spits out their names (Oddly not yours?) > The previous default of "permissive" is meant for developing or > debugging a disaggregated system. However, this default makes it too > easy to accidentally boot a machine in this state, which does not > place any restrictions on guests. This is not suitable for normal > systems because any guest can perform any operation (including > operations like rebooting the machine, kexec, and reading or writing > another domain's memory). > > This change will cause the boot to fail if you do not specify an XSM > policy during boot; if you need to load a policy from dom0, use the > "flask=late" boot parameter. > > Originally by Konrad Rzeszutek Wilk ; modified > to also change the default value of flask_enforcing so that the policy > is not still in permissive mode. This also removes the (no longer > documented) command line argument directly changing that variable > since it has been superseded by the flask= parameter. > Reviwed-by: Konrad Rzeszutek Wilk .. however: > Signed-off-by: Daniel De Graaf > --- > > docs/misc/xen-command-line.markdown | 2 +- > docs/misc/xsm-flask.txt | 12 ++-- > xen/xsm/flask/flask_op.c| 8 +--- > 3 files changed, 12 insertions(+), 10 deletions(-) > > diff --git a/docs/misc/xen-command-line.markdown > b/docs/misc/xen-command-line.markdown > index ca77e3b..9e77f8a 100644 > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -662,7 +662,7 @@ to use the default. > ### flask > > `= permissive | enforcing | late | disabled` > > -> Default: `permissive` > +> Default: `enforcing` > > Specify how the FLASK security server should be configured. This > option is only available if the hypervisor was compiled with XSM > support (which can be enabled diff --git a/docs/misc/xsm-flask.txt > b/docs/misc/xsm-flask.txt index fb2fe9f..00a2b13 100644 > --- a/docs/misc/xsm-flask.txt > +++ b/docs/misc/xsm-flask.txt > @@ -283,12 +283,12 @@ for passthrough, run: > > This command must be rerun on each boot or after any policy reload. > > -The example policy was only tested with simple domain creation and > may be -missing rules allowing accesses by dom0 or domU when a number > of hypervisor -features are used. When first loading or writing a > policy, you should run FLASK -in permissive mode (the default) and > check the Xen logs (xl dmesg) for AVC -denials before using it in > enforcing mode (flask_enforcing=1 on the command -line, or xl setenforce). > +When first loading or writing a policy, you should run FLASK in > +permissive mode (flask=permissive on the command line) and check the > +Xen logs (xl dmesg) for AVC denials before using it in enforcing mode > +(the default value of the boot parameter, which can also be changed > +using xl setenforce). When using the default types for domains > +(domU_t), the example policy shipped with Xen should allow the same > operations on or between domains as when not using FLASK. > > > MLS/MCS policy > diff --git a/xen/xsm/flask/flask_op.c b/xen/xsm/flask/flask_op.c index > f4f5dd1..cdb462c 100644 > --- a/xen/xsm/flask/flask_op.c > +++ b/xen/xsm/flask/flask_op.c > @@ -25,12 +25,11 @@ > #define _copy_to_guest copy_to_guest > #define _copy_from_guest copy_from_guest > > -enum flask_bootparam_t __read_mostly flask_bootparam = > FLASK_BOOTPARAM_PERMISSIVE; > +enum flask_bootparam_t __read_mostly flask_bootparam = > +FLASK_BOOTPARAM_ENFORCING; > static void parse_flask_param(char *s); custom_param("flask", > parse_flask_param); > > -bool_t __read_mostly flask_enforcing = 0; > -boolean_param("flask_enforcing", flask_enforcing); > +bool_t __read_mostly flask_enforcing = 1; Since you set that to the default value should the parse_flask_param 'flask_enforcing = 1' for the 'enforcing' and 'late' be removed? (If you agree, the committer could do it). > > #define MAX_POLICY_SIZE 0x400 > > @@ -76,7 +75,10 @@ static void __init parse_flask_param(char *s) > else if ( !strcmp(s, "disabled") ) > flask_bootparam = FLASK_BOOTPARAM_DISABLED; > else if ( !strcmp(s, "permissive") ) > +{ > +flask_enforcing = 0; > flask_bootparam = FLASK_BOOTPARAM_PERMISSIVE; > +} > else > flask_bootparam = FLASK_BOOTPARAM_INVALID; } >
Re: [Xen-devel] [PATCH v2 2/3] xen: add hypercall option to temporarily pin a vcpu
Hi, -Original Message- From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of George Dunlap Sent: 01 March 2016 15:53 To: Juergen Gross <jgr...@suse.com>; xen-devel@lists.xen.org Cc: Wei Liu <wei.l...@citrix.com>; Stefano Stabellini <stefano.stabell...@citrix.com>; George Dunlap <george.dun...@citrix.com>; Andrew Cooper <andrew.coop...@citrix.com>; Dario Faggioli <dario.faggi...@citrix.com>; Ian Jackson <ian.jack...@citrix.com>; David Vrabel <david.vra...@citrix.com>; jbeul...@suse.com Subject: Re: [Xen-devel] [PATCH v2 2/3] xen: add hypercall option to temporarily pin a vcpu On 01/03/16 09:02, Juergen Gross wrote: > Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be > called on physical cpu 0 only. Linux drivers like dcdbas or i8k try to > achieve this by pinning the running thread to cpu 0, but in Dom0 this > is not enough: the vcpu must be pinned to physical cpu 0 via Xen, too. > > Add a stable hypercall option SCHEDOP_pin_temp to the sched_op > hypercall to achieve this. It is taking a physical cpu number as > parameter. If pinning is possible (the calling domain has the > privilege to make the call and the cpu is available in the domain's > cpupool) the calling vcpu is pinned to the specified cpu. The old cpu > affinity is saved. To undo the temporary pinning a cpu -1 is > specified. This will restore the original cpu affinity for the vcpu. > > Signed-off-by: Juergen Gross <jgr...@suse.com> > --- > V2: - limit operation to hardware domain as suggested by Jan Beulich > - some style issues corrected as requested by Jan Beulich > - use fixed width types in interface as requested by Jan Beulich > - add compat layer checking as requested by Jan Beulich > --- > xen/common/compat/schedule.c | 4 ++ > xen/common/schedule.c| 92 > +--- > xen/include/public/sched.h | 17 > xen/include/xlat.lst | 1 + > 4 files changed, 109 insertions(+), 5 deletions(-) > > diff --git a/xen/common/compat/schedule.c > b/xen/common/compat/schedule.c index 812c550..73b0f01 100644 > --- a/xen/common/compat/schedule.c > +++ b/xen/common/compat/schedule.c > @@ -10,6 +10,10 @@ > > #define do_sched_op compat_sched_op > > +#define xen_sched_pin_temp sched_pin_temp CHECK_sched_pin_temp; > +#undef xen_sched_pin_temp > + > #define xen_sched_shutdown sched_shutdown CHECK_sched_shutdown; > #undef xen_sched_shutdown diff --git a/xen/common/schedule.c > b/xen/common/schedule.c index b0d4b18..653f852 100644 > --- a/xen/common/schedule.c > +++ b/xen/common/schedule.c > @@ -271,6 +271,12 @@ int sched_move_domain(struct domain *d, struct cpupool > *c) > struct scheduler *old_ops; > void *old_domdata; > > +for_each_vcpu ( d, v ) > +{ > +if ( v->affinity_broken ) > +return -EBUSY; > +} > + > domdata = SCHED_OP(c->sched, alloc_domdata, d); > if ( domdata == NULL ) > return -ENOMEM; > @@ -669,6 +675,14 @@ int cpu_disable_scheduler(unsigned int cpu) > if ( cpumask_empty(_affinity) && > cpumask_test_cpu(cpu, v->cpu_hard_affinity) ) > { > +if ( v->affinity_broken ) > +{ > +/* The vcpu is temporarily pinned, can't move it. */ > +vcpu_schedule_unlock_irqrestore(lock, flags, v); > +ret = -EBUSY; > +break; > +} Does this mean that if the user closes the laptop lid while one of these drivers has vcpu0 pinned, that Xen will crash (see xen/arch/x86/smpboot.c:__cpu_disable())? Or is it the OS's job to make sure that all temporary pins are removed before suspending? Also -- have you actually tested the "cpupool move while pinned" functionality to make sure it actually works? There's a weird bit in cpupool_unassign_cpu_helper() where after calling cpu_disable_scheduler(cpu), it unconditionally sets the cpu bit in the cpupool_free_cpus mask, even if it returns an error. That can't be right, even for the existing -EAGAIN case, can it? I see that you have a loop to retry this call several times in the next patch; but what if it fails every time -- what state is the system in? And, in general, what happens if the device driver gets mixed up and forgets to unpin the vcpu? Is the only recourse to reboot your host (or deal with the fact that you can't reconfigure your cpupools)? -George Sorry, lost the original thread so replying at the top of mail chain. +static XSM_INLINE int xsm_schedop_pin_temp(XSM_DEFAULT_VOID) +{ + XSM_ASSERT_ACTI
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
Based on the discussion below, can I assume there is an agreement for using processor model for filtering or chipset ID will be the preferred candidate. Thanks Anshul Makkar -Original Message- From: Tian, Kevin [mailto:kevin.t...@intel.com] Sent: 26 November 2015 07:17 To: Malcolm Crossley <malcolm.cross...@citrix.com>; Jan Beulich <jbeul...@suse.com>; Andrew Cooper <andrew.coop...@citrix.com>; Anshul Makkar <anshul.mak...@citrix.com> Cc: Zhang, Yang Z <yang.z.zh...@intel.com>; xen-devel@lists.xen.org Subject: RE: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors. > From: Malcolm Crossley [mailto:malcolm.cross...@citrix.com] > Sent: Wednesday, November 25, 2015 11:59 PM > > On 25/11/15 15:38, Jan Beulich wrote: > >>>> On 25.11.15 at 16:13, <andrew.coop...@citrix.com> wrote: > >> On 25/11/15 10:49, Jan Beulich wrote: > >>>>>> On 25.11.15 at 11:28, <andrew.coop...@citrix.com> wrote: > >>>> On 24/11/15 17:41, Jan Beulich wrote: > >>>>>>>> On 24.11.15 at 18:17, wrote: > >>>>>> --- a/xen/drivers/passthrough/vtd/quirks.c > >>>>>> +++ b/xen/drivers/passthrough/vtd/quirks.c > >>>>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) > >>>>>> /* Tylersburg interrupt remap quirk */ > >>>>>> if ( iommu_intremap ) > >>>>>> tylersburg_intremap_quirk(); > >>>>>> + > >>>>>> +/* > >>>>>> + * Disable shared EPT ("sharept") on Sandybridge and older > >>>>>> processors > >>>>>> + * by default. > >>>>>> + * SandyBridge has no huge page support for IOTLB which > >>>>>> + leads to > >> fallback > >>>>>> + * on 4k pages and leads to performance degradation. > >>>>>> + * > >>>>>> + * Shared EPT ("sharept") will be disabled only if user has not > >>>>>> + * provided explicit choice on the command line thus > >>>>>> + iommu_hap_pt_share > >> is > >>>>>> + * at its initialized value of -1. > >>>>>> + */ > >>>>>> +if ( (boot_cpu_data.x86 == 0x06 && > >>>>>> + (boot_cpu_data.x86_model <= 0x2F > || > >>>>>> + boot_cpu_data.x86_model == 0x36)) && > >>>>>> + (iommu_hap_pt_share == > -1) ) > >>>>>> +iommu_hap_pt_share = 0; > >>>>> If we really want to do this, then I think we should key this on > >>>>> EPT but not VT-d having 2M support, instead of on CPU models. > >>>> This check is already performed by vtd_ept_page_compatible() > >>> Yeah, I realized there would be such a check on the way home. > >>> > >>>> The problem is that SandyBridge IOMMUs advertise 2M support and > >>>> do function with it, but cannot cache 2MB translations in the IOTLBs. > >>>> > >>>> As a result, attempting to use 2M translations causes > >>>> substantially worse performance than 4K translations. > >>> So commit message and comment should make this more explicit, to > >>> avoid the impression "IOTLB" isn't just the relatively common > >>> mis-naming of "IOMMU". > >>> > >>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb? > >>> > >>> Further the model based check is relatively broad, and includes > >>> Atoms (0x36 actually is one), which can't be considered > >>> "Sandybridge or older" imo. > >>> > >>> And finally I'm not fully convinced using CPU model info to deduce > >>> chipset behavior is entirely correct (albeit perhaps in practice > >>> it'll be fine except maybe when running Xen itself virtualized). > >> > >> What else would you suggest? I can't think of any better > >> identifying information. > > > > Chipset IDs / revisions? > > In this case the IOMMU is integrated into the Sandybridge-EP processor itself. > Unfortunately there's no register to query the IOTLB configuration of > the IOMMU and so we're stuck identifying the via the processor model number > itself. > > Malcolm > I'm OK to use processor model here, though ideally Jan is right. :-) Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
Snabbswitch (virtualized switch) also encountered similar problem : https://groups.google.com/forum/#!topic/snabb-devel/xX0yFzeXylI Thanks Anshul Makkar -Original Message- From: Andrew Cooper [mailto:andrew.coop...@citrix.com] Sent: 01 December 2015 10:34 To: Konrad Rzeszutek Wilk <konrad.w...@oracle.com> Cc: Jan Beulich <jbeul...@suse.com>; Kevin Tian <kevin.t...@intel.com>; yang.z.zh...@intel.com; Malcolm Crossley <malcolm.cross...@citrix.com>; Anshul Makkar <anshul.mak...@citrix.com>; xen-devel@lists.xen.org Subject: Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors. On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote: > On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote: >> On 26/11/15 13:48, Malcolm Crossley wrote: >>> On 26/11/15 13:46, Jan Beulich wrote: >>>>>>> On 25.11.15 at 11:28, <andrew.coop...@citrix.com> wrote: >>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do >>>>> function with it, but cannot cache 2MB translations in the IOTLBs. >>>>> >>>>> As a result, attempting to use 2M translations causes >>>>> substantially worse performance than 4K translations. >>>> Btw - how does this get explained? At a first glance, even if 2Mb >>>> translations don't get entered into the TLB, it should still be one >>>> less page table level to walk for the IOMMU, and should hence >>>> nevertheless be a benefit. Yet you even say _substantially_ worse >>>> performance results. >>> There is a IOTLB for the 4K translation so if you only use 4K >>> translations then you get to take advantage of the IOTLB. >>> >>> If you use the 2Mb translation then a page table walk has to be >>> performed every time there's a DMA access to that region of the BFN >>> address space. >> Also remember that a high level dma access (from the point of view of >> a >> driver) will be fragmented at the PCIe max packet size, which is >> typically 256 bytes. >> >> So by not caching the 2Mb translation, a dma access of 4k may undergo >> 16 pagetable walks, one for each PCIe packet. >> >> We observed that using 2Mb mappings results in a 40% overhead, >> compared to using 4k mappings, from the point of view of a sample network >> workload. > How did you observe this? I am mighty curious what kind of performance > tools you used to find this as I would love to figure out if some of > the issues we have seen are related to this? The 40% difference is just in terms of network throughput of a VF, given a workload which can normally saturate line rate on the card. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
From: Anshul Makkar <anshul.mak...@citrix.com> Sandybridge or earlier processors don't have huge page support for IOTLB which leads to fallback on 4k pages and causes performance issues. Shared EPT will be disabled only if the user has not provided explicit choice on the command line. Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com> --- v2: * Removed the use of extra variable to control the shared EPT and made the existent variable as tristate. * Narrowed down the check for processors to Sandybridge and older including Atom processors. docs/misc/xen-command-line.markdown | 2 +- xen/drivers/passthrough/iommu.c | 2 +- xen/drivers/passthrough/vtd/quirks.c | 14 ++ xen/include/xen/iommu.h | 2 +- 4 files changed, 17 insertions(+), 3 deletions(-) diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown index a2e427c..6b69ba2 100644 --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -896,7 +896,7 @@ debug hypervisor only). > `sharept` -> Default: `true` +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or earlier. >> Control whether CPU and IOMMU page tables should be shared. diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c index d5137733..9367987 100644 --- a/xen/drivers/passthrough/iommu.c +++ b/xen/drivers/passthrough/iommu.c @@ -51,7 +51,7 @@ bool_t __read_mostly iommu_passthrough; bool_t __read_mostly iommu_snoop = 1; bool_t __read_mostly iommu_qinval = 1; bool_t __read_mostly iommu_intremap = 1; -bool_t __read_mostly iommu_hap_pt_share = 1; +s8 __read_mostly iommu_hap_pt_share = -1; bool_t __read_mostly iommu_debug; bool_t __read_mostly amd_iommu_perdev_intremap = 1; diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c index 143..7d63c8d 100644 --- a/xen/drivers/passthrough/vtd/quirks.c +++ b/xen/drivers/passthrough/vtd/quirks.c @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) /* Tylersburg interrupt remap quirk */ if ( iommu_intremap ) tylersburg_intremap_quirk(); + +/* + * Disable shared EPT ("sharept") on Sandybridge and older processors + * by default. + * SandyBridge has no huge page support for IOTLB which leads to fallback + * on 4k pages and leads to performance degradation. + * + * Shared EPT ("sharept") will be disabled only if user has not + * provided explicit choice on the command line thus iommu_hap_pt_share is + * at its initialized value of -1. + */ +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F || + boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) ) +iommu_hap_pt_share = 0; } /* diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h index 8f3a20e..d52d06f 100644 --- a/xen/include/xen/iommu.h +++ b/xen/include/xen/iommu.h @@ -31,7 +31,7 @@ extern bool_t iommu_enable, iommu_enabled; extern bool_t force_iommu, iommu_verbose; extern bool_t iommu_workaround_bios_bug, iommu_igfx, iommu_passthrough; extern bool_t iommu_snoop, iommu_qinval, iommu_intremap; -extern bool_t iommu_hap_pt_share; +extern s8 iommu_hap_pt_share; extern bool_t iommu_debug; extern bool_t amd_iommu_perdev_intremap; -- 1.9.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel