Re: [Xen-devel] [PATCH 2/3] credit2: libxl related changes to add support for runqueue per cpupool.

2017-11-16 Thread Anshul Makkar

[Trimming the Cc-list a bit]


On 9/14/17 7:37 AM, Juergen Gross wrote:

On 12/09/17 02:45, anshulmakkar wrote:

Introduces scheduler specific parameter at libxl level which are
passed on to libxc. eg runqueue for credit2

Signed-off-by: Anshul Makkar <anshulmak...@gmail.com>

  int libxl_cpupool_destroy(libxl_ctx *ctx, uint32_t poolid);
  int libxl_cpupool_rename(libxl_ctx *ctx, const char *name, uint32_t poolid);
  int libxl_cpupool_cpuadd(libxl_ctx *ctx, uint32_t poolid, int cpu);
diff --git a/tools/libxl/libxl_cpupool.c b/tools/libxl/libxl_cpupool.c
index 85b0688..e3ce7b3 100644
--- a/tools/libxl/libxl_cpupool.c
+++ b/tools/libxl/libxl_cpupool.c
@@ -130,7 +130,7 @@ int libxl_get_freecpus(libxl_ctx *ctx, libxl_bitmap *cpumap)
  int libxl_cpupool_create(libxl_ctx *ctx, const char *name,
   libxl_scheduler sched,
   libxl_bitmap cpumap, libxl_uuid *uuid,
- uint32_t *poolid)
+ uint32_t *poolid, const libxl_scheduler_params 
*sched_params)
  {
  GC_INIT(ctx);
  int rc;
@@ -138,6 +138,7 @@ int libxl_cpupool_create(libxl_ctx *ctx, const char *name,
  xs_transaction_t t;
  char *uuid_string;
  uint32_t xcpoolid;
+xc_schedparam_t xc_sched_param;
  
  /* Accept '0' as 'any poolid' for backwards compatibility */

  if ( *poolid == LIBXL_CPUPOOL_POOLID_ANY
@@ -151,8 +152,18 @@ int libxl_cpupool_create(libxl_ctx *ctx, const char *name,
  GC_FREE;
  return ERROR_NOMEM;
  }
+if (sched_params)
+{
+xc_sched_param.u.sched_credit2.ratelimit_us =
+
sched_params->u.credit2.ratelimit_us;
+xc_sched_param.u.sched_credit2.runq = sched_params->u.credit2.runqueue;
+xc_sched_param.u.sched_credit.tslice_ms = 
sched_params->u.credit.tslice_ms;
+xc_sched_param.u.sched_credit.ratelimit_us = 
sched_params->u.credit.ratelimit_us;

Don't you need some input parameter validation here?

Agree. Will perform validation.

+}
+else
+xc_sched_param.u.sched_credit2.runq = LIBXL_CREDIT2_RUNQUEUE_DEFAULT;

So you are passing the LIBXL defines down to the hypervisor expecting
they match. I think this is a major layering violation.

I need to pass the DEFAULT runq arrangement if the user has not selected
any option and I want to do it near to the top level (libxc) so that 
consistency

can be maintained at the lower scheduler layer.
Please can you suggest alternative that will maintain layering consistency.




Juergen

anshul


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 2/4] xen: sched: optimize exclusive pinning case (Credit1 & 2)

2017-09-20 Thread Anshul Makkar



On 9/15/17 6:35 PM, Dario Faggioli wrote:
  
  static unsigned int __read_mostly opt_migrate_resist = 500;

  integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
@@ -1453,6 +1459,26 @@ runq_tickle(const struct scheduler *ops, struct 
csched2_vcpu *new, s_time_t now)
  (unsigned char *));
  }
  
+/*

+ * Exclusive pinning is when a vcpu has hard-affinity with only one
+ * cpu, and there is no other vcpu that has hard-affinity with that
+ * same cpu. This is infrequent, but if it happens, is for achieving
+ * the most possible determinism, and least possible overhead for
+ * the vcpus in question.
+ *
+ * Try to identify the vast majority of these situations, and deal
+ * with them quickly.

Sorry, if I have missed to review the earlier series on the same subject.
But, I am not completely satisfied with the condition that pinning to a
pcpu is only possible if and only if no other vcpu has hard affinity of 
that pcpu.

I think we can do away with this condition and give pinning a priority.
Your thoughts please..

+ */
+if ( unlikely((new->flags & CSFLAG_pinned) &&
+  cpumask_test_cpu(cpu, >idle) &&
+  !cpumask_test_cpu(cpu, >tickled)) )
+{
+ASSERT(cpumask_cycle(cpu, new->vcpu->cpu_hard_affinity) == cpu);
+SCHED_STAT_CRANK(tickled_idle_cpu_excl);
+ipid = cpu;
+goto tickle;
+}
+
  for_each_affinity_balance_step( bs )
  {
  /* Just skip first step, if we don't have a soft affinity */
@@ -2826,6 +2852,19 @@ csched2_dom_cntl(
  return rc;
  }
  
+static void

+csched2_aff_cntl(const struct scheduler *ops, struct vcpu *v,
+ const cpumask_t *hard, const cpumask_t *soft)
+{
+struct csched2_vcpu *svc = csched2_vcpu(v);
+
+/* Are we becoming exclusively pinned? */
+if ( cpumask_weight(hard) == 1 )
+__set_bit(__CSFLAG_pinned, >flags);
+else
+__clear_bit(__CSFLAG_pinned, >flags);
+}
+
  static int csched2_sys_cntl(const struct scheduler *ops,
  struct xen_sysctl_scheduler_op *sc)

Looks fine.

Anshul


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/4] xen: sched: optimize exclusive pinning and soft-affinity checking

2017-09-18 Thread Anshul Makkar



On 9/15/17 6:35 PM, Dario Faggioli wrote:

Hello,

Dario Faggioli (4):
   xen: sched: introduce 'adjust_affinity' hook.
   xen: sched: optimize exclusive pinning case (Credit1 & 2)
   xen: sched: improve checking soft-affinity
   xen: sched: simplify (and speedup) checking soft-affinity

  xen/arch/x86/dom0_build.c|4 +
  xen/common/sched_credit.c|  114 +++---
  xen/common/sched_credit2.c   |   50 --
  xen/common/sched_null.c  |8 +--
  xen/common/schedule.c|   67 +++--
  xen/include/xen/perfc_defn.h |1
  xen/include/xen/sched-if.h   |   16 +++---
  xen/include/xen/sched.h  |5 ++
  8 files changed, 188 insertions(+), 77 deletions(-)
--
Please can you share the linkĀ  for the branch on which these changes 
have been done .


Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/4] xen: credit2: implement utilization cap

2017-08-24 Thread Anshul Makkar



On 8/18/17 4:50 PM, Dario Faggioli wrote:
  
@@ -474,6 +586,12 @@ static inline struct csched2_runqueue_data *c2rqd(const struct scheduler *ops,

  return _priv(ops)->rqd[c2r(cpu)];
  }
  
+/* Does the domain of this vCPU have a cap? */

+static inline bool has_cap(const struct csched2_vcpu *svc)
+{
+return svc->budget != STIME_MAX;
+}
+
  /*
   * Hyperthreading (SMT) support.
   *
@@ -1515,7 +1633,16 @@ static void reset_credit(const struct scheduler *ops, 
int cpu, s_time_t now,
   * that the credit it has spent so far get accounted.
   */
  if ( svc->vcpu == curr_on_cpu(svc_cpu) )
+{
  burn_credits(rqd, svc, now);
+/*
+ * And, similarly, in case it has run out of budget, as a
+ * consequence of this round of accounting, we also must inform
+ * its pCPU that it's time to park it, and pick up someone else.
+ */
+if ( unlikely(svc->budget <= 0) )
+tickle_cpu(svc_cpu, rqd);
This is for accounting of credit. Why it willl impact the budget. Do you 
intend to refer that

budget of current vcpu expired while doing calculation for credit ??

+}
  
  start_credit = svc->credit;
  
@@ -1571,27 +1698,35 @@ void burn_credits(struct csched2_runqueue_data *rqd,
  
  delta = now - svc->start_time;
  
-if ( likely(delta > 0) )

-{
-SCHED_STAT_CRANK(burn_credits_t2c);
-t2c_update(rqd, delta, svc);
-svc->start_time = now;
-}
-else if ( delta < 0 )
+if ( unlikely(delta <= 0) )
  {

+static void replenish_domain_budget(void* data)
+{
+struct csched2_dom *sdom = data;
+unsigned long flags;
+s_time_t now;
+LIST_HEAD(parked);
+
+spin_lock_irqsave(>budget_lock, flags);
+
+now = NOW();
+
+/*
+ * Let's do the replenishment. Note, though, that a domain may overrun,
+ * which means the budget would have gone below 0 (reasons may be system
+ * overbooking, accounting issues, etc.). It also may happen that we are
+ * handling the replenishment (much) later than we should (reasons may
+ * again be overbooking, or issues with timers).
+ *
+ * Even in cases of overrun or delay, however, we expect that in 99% of
+ * cases, doing just one replenishment will be good enough for being able
+ * to unpark the vCPUs that are waiting for some budget.
+ */
+do_replenish(sdom);
+
+/*
+ * And now, the special cases:
+ * 1) if we are late enough to have skipped (at least) one full period,
+ * what we must do is doing more replenishments. Note that, however,
+ * every time we add tot_budget to the budget, we also move next_repl
+ * away by CSCHED2_BDGT_REPL_PERIOD, to make sure the cap is always
+ * respected.
+ */
+if ( unlikely(sdom->next_repl <= now) )
+{
+do
+do_replenish(sdom);
+while ( sdom->next_repl <= now );
+}
Just a bit confused. Have you seen this kind of scenario. Please can you 
explain it.

Is this condition necessary.

+/*
+ * 2) if we overrun by more than tot_budget, then budget+tot_budget is
+ * still < 0, which means that we can't unpark the vCPUs. Let's bail,
+ * and wait for future replenishments.
+ */
+if ( unlikely(sdom->budget <= 0) )
+{
+spin_unlock_irqrestore(>budget_lock, flags);
+goto out;
+}
"if we overran by more than tot_budget in previous run", make is more 
clear..

+
+/* Since we do more replenishments, make sure we didn't overshot. */
+sdom->budget = min(sdom->budget, sdom->tot_budget);
+
+/*
+ * As above, let's prepare the temporary list, out of the domain's
+ * parked_vcpus list, now that we hold the budget_lock. Then, drop such
+ * lock, and pass the list to the unparking function.
+ */
+list_splice_init(>parked_vcpus, );
+
+spin_unlock_irqrestore(>budget_lock, flags);
+
+unpark_parked_vcpus(sdom->dom->cpupool->sched, );
+
+ out:
+set_timer(sdom->repl_timer, sdom->next_repl);
+}
+
  #ifndef NDEBUG
  static inline void
  csched2_vcpu_check(struct vcpu *vc)
@@ -1658,6 +2035,9 @@ csched2_alloc_vdata(const struct scheduler *ops, struct 
vcpu *vc, void *dd)
  }
  svc->tickled_cpu = -1;
  
+


Rest, looks good to me.

Thanks
Anshul





___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 2/7] xen/tools: credit2: soft-affinity awareness in runq_tickle()

2017-06-23 Thread Anshul Makkar

On 16/06/2017 15:13, Dario Faggioli wrote:

Soft-affinity support is usually implemented by means
of a two step "balancing loop", where:
- during the first step, we consider soft-affinity
  (if the vcpu has one);
- during the second (if we get to it), we consider
  hard-affinity.

In runq_tickle(), we need to do that for checking
whether we can execute the waking vCPU on an pCPU
that is idle. In fact, we want to be sure that, if
there is an idle pCPU in the vCPU's soft affinity,
we'll use it.

If there are no such idle pCPUs, though, and we
have to check non-idle ones, we can avoid the loop
and to both hard and soft-affinity in one pass.

In fact, we can we scan runqueue and compute a
"score" for each vCPU which is running on each pCPU.
The idea is, since we may have to preempt someone:
- try to make sure that the waking vCPU will run
  inside its soft-affinity,
- try to preempt someone that is running outside
  of its own soft-affinity.

The value of the score is added to a trace record,
so xenalyze's code and tools/xentrace/formats are
updated accordingly.

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>
---
Cc: George Dunlap <george.dun...@citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
Cc: Ian Jackson <ian.jack...@eu.citrix.com>
Cc: Wei Liu <wei.l...@citrix.com>
---


reviewed. Looks good.

Anshul



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/7] xen: sched: factor affinity helpers out of sched_credit.c

2017-06-23 Thread Anshul Makkar

On 16/06/2017 15:13, Dario Faggioli wrote:

In fact, we want to be able to use them from any scheduler.

While there, make the moved code use 'v' for struct_vcpu*
variable, like it should be done everywhere.

No functional change intended.

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>
Signed-off-by: Justin T. Weaver <jtwea...@hawaii.edu>
Reviewed-by: George Dunlap <george.dun...@citrix.com>
---
Cc: Anshul Makkar <anshul.mak...@citrix.com>
---
 xen/common/sched_credit.c  |   97 +++-
 xen/include/xen/sched-if.h |   64 +
 2 files changed, 79 insertions(+), 82 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index efdf6bf..53773df 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -136,27 +136,6 @@

*

+ * Hard affinity balancing is always necessary and must never be skipped.
+ * But soft affinity need only be considered when it has a functionally
+ * different effect than other constraints (such as hard affinity, cpus
+ * online, or cpupools).
+ *
+ * Soft affinity only needs to be considered if:
+ * * The cpus in the cpupool are not a subset of soft affinity
do you mean cpus in a cpupool are not in a subset of the soft affinity 
of a vcpu which is a runnable state ?

+ * * The hard affinity is not a subset of soft affinity
+ * * There is an overlap between the soft affinity and the mask which is
+ *   currently being considered.
+ */


reviewed !! looks good.

Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap

2017-06-15 Thread Anshul Makkar

On 13/06/2017 22:13, Dario Faggioli wrote:

On Tue, 2017-06-13 at 17:07 +0100, Anshul Makkar wrote:

On 12/06/2017 14:19, Dario Faggioli wrote:

@@ -92,6 +92,82 @@
  */


what I want to ask is that if the budget of the domain is
replenished,
but credit for the vcpus of that domain is not available, then what
happens.


Yes, but the point is that budget can be available or not, while
credits are always available. There's no such thing as credit not being
available at all.

The amount of credits each vcpu has decides which vcpu will run, in the
sense that it will be the one that has the highest amount of credits.
The others will indeed wait, but because they've got less credit than
the one that runs, not because they don't have credits available.

Ok, as discussed, credits are resetted if it reaches 0. In that sense 
its being considered "always available"..



I believe, vcpus won't be scheduled (even if they have budget_quota)
till they get their credit replenished.


Credits are not exhausted or replenished.


But... it's already totally dynamic.


csched2_dom_cntl()
{
svc->budget_quota = max(sdom->tot_budget / sdom->nr_vcpus,
 CSCHED2_MIN_TIMER);
}
If domain->tot_budge = 200
nr_cpus is 4, then each cpu gets 50%.
How this is dynamic allocation ? We are not considering vcpu
utilization
of other vcpus of domain before allocating budget_quota for some
vcpu.


Right. Well, what this means is that each vcpu will get budget in
chunks of tot_budget/nr_vcpus. But then, how much budget each vcpu will
actually be able to get and consume in each period, it's impossible to
know in advance, as it will depend on overall system load, and the
behavior of the various vcpus of the domain.
Yeah, the current implementation is dynamic in the sense that vcpu can 
execute in total more than its quota across multiple budget cycles, but 
its static in the sense that vcpu can't execute more than its budget 
quota in a single budget cycle.s



In runq candidate we have a code base


Regards,
Dario


Reviewed-by: Anshul Makkar<anshul.mak...@citrix.com>
Anshul


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap

2017-06-13 Thread Anshul Makkar

On 12/06/2017 14:19, Dario Faggioli wrote:

Hey,



Budget is burned by the domain's vCPUs, in a similar way to how
credits are.

When a domain runs out of budget, its vCPUs can't run any longer.


if the vcpus of a domain have credit and if budget has run out, will
the
vcpus won't be scheduled.


Is this a question? Assuming it is, what do you mean with "domain have
credit"? Domains always have credits, and they never run out of them.
There's no such thing as a domain not being runnable because it does
not have credits.

"domain have credit" ? the statement is if the "vcpus of domain have 
credit".



About budget, a domain with <= 0 budget means all its vcpus are not
runnable, and hence won't be scheduler, no matter their credits.

You answered the question here.



@@ -92,6 +92,82 @@
  */

 /*
+ * Utilization cap:
+ *
+ * Setting an pCPU utilization cap for a domain means the
following:
+ *
+ * - a domain can have a cap, expressed in terms of % of physical
+ * For implementing this, we use the following approach:
+ *
+ * - each domain is given a 'budget', an each domain has a timer,
which
+ *   replenishes the domain's budget periodically. The budget is
the amount
+ *   of time the vCPUs of the domain can use every 'period';
+ *
+ * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for
all domains
+ *   (but each domain has its own timer; so the all are periodic
by the same
+ *   period, but replenishment of the budgets of the various
domains, at
+ *   periods boundaries, are not synchronous);
+ *
+ * - when vCPUs run, they consume budget. When they don't run,
they don't
+ *   consume budget. If there is no budget left for the domain, no
vCPU of
+ *   that domain can run. If a vCPU tries to run and finds that
there is no
+ *   budget, it blocks.
+ *   Budget never expires, so at whatever time a vCPU wants to
run, it can
+ *   check the domain's budget, and if there is some, it can use
it.
+ *
+ * - budget is replenished to the top of the capacity for the
domain once
+ *   per period. Even if there was some leftover budget from
previous period,
+ *   though, the budget after a replenishment will always be at
most equal
+ *   to the total capacify of the domain ('tot_budget');
+ *


budget is replenished but credits not available ?


Still not getting this.
what I want to ask is that if the budget of the domain is replenished, 
but credit for the vcpus of that domain is not available, then what 
happens.
I believe, vcpus won't be scheduled (even if they have budget_quota) 
till they get their credit replenished.



budget is finished but not vcpu has not reached the rate limit
boundary ?


Budget takes precedence over ratelimiting. This is important to keep
cap working "regularly", rather then in some kind of permanent "trying-
to-keep-up-with-overruns-in-previous-periods" state.

And, ideally, a vcpu cap and ratelimiting should be set in such a way
that they don't step on each other toe (or do that only rarely). I can
see about trying to print a warning when I detect potential tricky
values (but it's not easy, considering budget is per-domain, so I can't
be sure about how much each vcpu will actually get, and whether or not
why you can't be sure. Scheduler know the domain budget, number of vcpus 
per domain and we can calculate the budget_quota and translate it into 
cpu slot duration.
Similarly , the value of rate limit is also known. We can compare and 
give a warning to the user if the budget_quota is less than rate limit.


This is very important for the user to know, if wrongly chosen, it can 
adversely affect the system's performance with frequent context 
switches. (the problem we are aware of).



that will reveal to be significantly less than ratelimiting the most of
the times).


+ * - when a budget replenishment occurs, if there are vCPUs that
had been
+ *   blocked because of lack of budget, they'll be unblocked, and
they will
+ *   (potentially) be able to run again.
+ *
+ * Finally, some even more implementation related detail:
+ *
+ * - budget is stored in a domain-wide pool. vCPUs of the domain
that want
+ *   to run go to such pool, and grub some. When they do so, the
amount
+ *   they grabbed is _immediately_ removed from the pool. This
happens in
+ *   vcpu_try_to_get_budget();
+ *
+ * - when vCPUs stop running, if they've not consumed all the
budget they
+ *   took, the leftover is put back in the pool. This happens in
+ *   vcpu_give_budget_back();


200% budget, 4 vcpus to run on 4 pcpus each allowed only 50% of
budget.
This is a static allocation .


Err... again, are you telling or asking?

giving an example to prove its a static allocation.



 for eg. 2 vcpus running on 2 pvpus at 20%
budgeted time, if vcpu3 wants to execute some cpu intensive task,
then
it won't be allowed to allowed to use more than 50% of the pcpus.


With what parameters? You mean with these ones you cite here (i.e., a
200% budget)? If the VM has 200%, and vcpu1 and vcpu2 consumes
20%+20%=40%, 

Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap

2017-06-12 Thread Anshul Makkar

On 08/06/2017 13:08, Dario Faggioli wrote:

This commit implements the Xen part of the cap mechanism for
Credit2.

A cap is how much, in terms of % of physical CPU time, a domain
can execute at most.

For instance, a domain that must not use more than 1/4 of one
physical CPU, must have a cap of 25%; one that must not use more
than 1+1/2 of physical CPU time, must be given a cap of 150%.

Caps are per domain, so it is all a domain's vCPUs, cumulatively,
that will be forced to execute no more than the decided amount.

This is implemented by giving each domain a 'budget', and using
a (per-domain again) periodic timer. Values of budget and 'period'
are chosen so that budget/period is equal to the cap itself.

Budget is burned by the domain's vCPUs, in a similar way to how
credits are.

When a domain runs out of budget, its vCPUs can't run any longer.
if the vcpus of a domain have credit and if budget has run out, will the 
vcpus won't be scheduled.

They can gain, when the budget is replenishment by the timer, which
event happens once every period.

Blocking the vCPUs because of lack of budget happens by
means of a new (_VPF_parked) pause flag, so that, e.g.,
vcpu_runnable() still works. This is similar to what is
done in sched_rtds.c, as opposed to what happens in
sched_credit.c, where vcpu_pause() and vcpu_unpause()
(which means, among other things, more overhead).

Note that xenalyze and tools/xentrace/format are also modified,
to keep them updated with one modified event.

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>
---
Cc: George Dunlap <george.dun...@eu.citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
Cc: Andrew Cooper <andrew.coop...@citrix.com>
Cc: Jan Beulich <jbeul...@suse.com>
Cc: Ian Jackson <ian.jack...@eu.citrix.com>
Cc: Wei Liu <wei.l...@citrix.com>
---
 tools/xentrace/formats |2
 tools/xentrace/xenalyze.c  |   10 +
 xen/common/sched_credit2.c |  470 +---
 xen/include/xen/sched.h|3
 4 files changed, 445 insertions(+), 40 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 8b31780..142b0cf 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -51,7 +51,7 @@

 0x00022201  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tick
 0x00022202  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_pos   [ 
dom:vcpu = 0x%(1)08x, pos = %(2)d]
-0x00022203  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit burn[ 
dom:vcpu = 0x%(1)08x, credit = %(2)d, delta = %(3)d ]
+0x00022203  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit burn[ 
dom:vcpu = 0x%(1)08x, credit = %(2)d, budget = %(3)d, delta = %(4)d ]
 0x00022204  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_add
 0x00022205  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle_check   [ 
dom:vcpu = 0x%(1)08x, credit = %(2)d ]
 0x00022206  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle [ cpu = 
%(1)d ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index fa608ad..c16c02d 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7680,12 +7680,14 @@ void sched_process(struct pcpu_info *p)
 if(opt.dump_all) {
 struct {
 unsigned int vcpuid:16, domid:16;
-int credit, delta;
+int credit, budget, delta;
 } *r = (typeof(r))ri->d;

-printf(" %s csched2:burn_credits d%uv%u, credit = %d, delta = 
%d\n",
-   ri->dump_header, r->domid, r->vcpuid,
-   r->credit, r->delta);
+printf(" %s csched2:burn_credits d%uv%u, credit = %d, ",
+   ri->dump_header, r->domid, r->vcpuid, r->credit);
+if ( r->budget != INT_MIN )
+printf("budget = %d, ", r->budget);
+printf("delta = %d\n", r->delta);
 }
 break;
 case TRC_SCHED_CLASS_EVT(CSCHED2, 5): /* TICKLE_CHECK  */
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 126417c..ba4bf4b 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -92,6 +92,82 @@
  */

 /*
+ * Utilization cap:
+ *
+ * Setting an pCPU utilization cap for a domain means the following:
+ *
+ * - a domain can have a cap, expressed in terms of % of physical CPU time.
+ *   A domain that must not use more than 1/4 of _one_ physical CPU, will
+ *   be given a cap of 25%; a domain that must not use more than 1+1/2 of
+ *   physical CPU time, will be given a cap of 150%;
+ *
+ * - caps are per-domain (not per-vCPU). If a domain has only 1 vCPU, and
+ *   a 40% cap, that one vCPU will use 40% of one pCPU. If a somain has 4
+ *   vCPUs, and a 200% cap, all its 4 vCPUs are allowed to run for (the
+ *   equivalent of) 100% time on 2 pCPUs. How much eac

Re: [Xen-devel] DESIGN: CPUID part 3

2017-06-09 Thread Anshul Makkar

On 08/06/2017 14:12, Andrew Cooper wrote:

Presented herewith is the a plan for the final part of CPUID work, which
primarily covers better Xen/Toolstack interaction for configuring the guests
CPUID policy.

A PDF version of this document is available from:

http://xenbits.xen.org/people/andrewcoop/cpuid-part-3.pdf

There are a number of still-open questions, which I would appreaciate views
on.

~Andrew


# Proposal

First and foremost, split the current **max\_policy** notion into separate
**max** and **default** policies.  This allows for the provision of features
which are unused by default, but may be opted in to, both at the hypervisor
level and the toolstack level.

At the hypervisor level, **max** constitutes all the features Xen can use on
the current hardware, while **default** is the subset thereof which are
supported features, the features which the user has explicitly opted in to,
and excluding any features the user has explicitly opted out of.

A new `cpuid=` command line option shall be introduced, whose internals are
generated automatically from the featureset ABI.  This means that all features
added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
line control.  (RFC: The same top level option can probably be used for
non-feature CPUID data control, although I can't currently think of any cases
where this would be used Also find a sensible way to express 'available but
not to be used by Xen', as per the current `smep` and `smap` options.)


At the guest level, **max** constitutes all the features which can be offered
to each type of guest on this hardware.  Derived from Xen's **default**
policy, it includes the supported features and explicitly opted in to
features, which are appropriate for the guest.

The guests **default** policy is then derived from its **max**, and includes
the supported features which are considered migration safe.  (RFC: This
distinction is rather fuzzy, but for example it wouldn't include things like
ITSC by default, as that is likely to go wrong unless special care is taken.)

Just from other perspective, what happens to the features which have 
been explicilty selected and are not migration safe ? Do, we consider 
them in guest's default policy.



All global policies (Xen and guest, max and default) shall be made available
to the toolstack, in a manner similar to the existing
Instead of all, do you see any harm if we expose only the default 
policies of Xen and Guest to toolstack.

_XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
taken which include all CPUID data, not just the feature bitmaps.

New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
which allows the toolstack to query and set the cpuid policy for a specific
domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, shall fail if Xen is
unhappy with any aspect of the policy during auditing.

When a domain is initially created, the appropriate guests **default** policy
is duplicated for use.  When auditing, Xen shall audit the toolstacks
requested policy against the guests **max** policy.  This allows experimental
features or non-migration-safe features to be opted in to, without those
features being imposed upon all guests automatically.




The `disable_migrate` field shall be dropped.  The concept of migrateability
is not boolean; it is a large spectrum, all of which needs to be managed by
the toolstack.
Can't this large spectrum result in a bool which can then be used for 
disable_migrate. Sorry, I can't see any value add in removing 
disable_migrate.

 The simple case is picking the common subset of features

between the source and destination.  This becomes more complicated e.g. if the
guest uses LBR/LER, at which point the toolstack needs to consider hardware
with the same LBR/LER format in addition to just the plain features.

`disable_migrate` is currently only used to expose ITSC to guests, but there
are cases where is perfectly safe to migrate such a guest, if the destination
host has the same TSC frequency or hardware TSC scaling support.

Finally, `disable_migrate` doesn't (and cannot reasonably) be used to inhibit
state gather operations, as this interferes with debugging and monitoring
tasks.


Thanks
Anshul


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [[PATCH -v2]] XenBus: Don't wait for the producer to fill the ring if

2017-05-23 Thread Anshul Makkar
The condition: if there is a space in the ring then wait for the producer
to fill the ring also evaluates to true even if the ring if full. It
leads to a deadlock where producer is waiting for consumer
to consume the items and consumer is waiting for producer to fill the ring.

Fix for the issue: check if the ring is full and then break from
the loop to consume the items from the ring.
eg. case: prod = 1272, cons = 248.

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
v2:
 * resolved the coding style issue.
 * modified the "if" condition statement to make it simpler.

 tools/firmware/hvmloader/xenbus.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/tools/firmware/hvmloader/xenbus.c 
b/tools/firmware/hvmloader/xenbus.c
index 448157d..2b89a56 100644
--- a/tools/firmware/hvmloader/xenbus.c
+++ b/tools/firmware/hvmloader/xenbus.c
@@ -141,7 +141,19 @@ static void ring_read(char *data, uint32_t len)
 /* Don't overrun the producer pointer */
 while ( (part = MASK_XENSTORE_IDX(rings->rsp_prod -
   rings->rsp_cons)) == 0 )
+{
+/*
+ * Don't wait for producer to fill the ring if it is already full.
+ * Condition happens when you write string > 1K into the ring.
+ * eg case prod=1272 cons=248.
+ */
+if ( rings->rsp_prod - rings->rsp_cons == XENSTORE_RING_SIZE )
+{
+part = XENSTORE_RING_SIZE;
+break;
+}
 ring_wait();
+}
 /* Don't overrun the end of the ring */
 if ( part > (XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons)) )
 part = XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons);
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] XenBus: Don't wait for producer to fill the ring if the ring

2017-05-17 Thread Anshul Makkar

On 17/05/2017 16:56, Jan Beulich wrote:

On 17.05.17 at 16:57, <anshul.mak...@citrix.com> wrote:

The condition to check for if there is space in the ring buffer
also becomes true if the buffer is full, thus consumer waits for
the producer to fill the buffer eventhough it is already full.

To resolve the situation, check if the buffer is full and then
break from the loop.
e.g case: prod = 1272, cons = 248.

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>


Please avoid indenting the entire commit message.
Oh. sorry about that. I saw this format in few previous commits and 
adopted it. Will correct it.



--- a/tools/firmware/hvmloader/xenbus.c
+++ b/tools/firmware/hvmloader/xenbus.c
@@ -141,7 +141,18 @@ static void ring_read(char *data, uint32_t len)
 /* Don't overrun the producer pointer */
 while ( (part = MASK_XENSTORE_IDX(rings->rsp_prod -
   rings->rsp_cons)) == 0 )
+{
+/* don't wait for producer to fill the ring if it is already full.
+ * Condition happens when you write string > 1K into the ring.
+ * eg case prod=1272 cons=248.
+ */


Comment style.


+if ( !(XENSTORE_RING_SIZE - (rings->rsp_prod - rings->rsp_cons)) )


Is this any different from

   if ( rings->rsp_prod - rings->rsp_cons == XENSTORE_RING_SIZE  )

No, its same. Ok, will use the suggested approach.


?

Jan




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] XenBus: Don't wait for producer to fill the ring if the ring

2017-05-17 Thread Anshul Makkar
The condition to check for if there is space in the ring buffer
also becomes true if the buffer is full, thus consumer waits for
the producer to fill the buffer eventhough it is already full.

To resolve the situation, check if the buffer is full and then
break from the loop.
e.g case: prod = 1272, cons = 248.

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
 tools/firmware/hvmloader/xenbus.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/firmware/hvmloader/xenbus.c 
b/tools/firmware/hvmloader/xenbus.c
index 448157d..f8fd730 100644
--- a/tools/firmware/hvmloader/xenbus.c
+++ b/tools/firmware/hvmloader/xenbus.c
@@ -141,7 +141,18 @@ static void ring_read(char *data, uint32_t len)
 /* Don't overrun the producer pointer */
 while ( (part = MASK_XENSTORE_IDX(rings->rsp_prod -
   rings->rsp_cons)) == 0 )
+{
+/* don't wait for producer to fill the ring if it is already full.
+ * Condition happens when you write string > 1K into the ring.
+ * eg case prod=1272 cons=248.
+ */
+if ( !(XENSTORE_RING_SIZE - (rings->rsp_prod - rings->rsp_cons)) )
+{
+part = XENSTORE_RING_SIZE;
+break;
+}
 ring_wait();
+}
 /* Don't overrun the end of the ring */
 if ( part > (XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons)) )
 part = XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons);
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 2/6] xen: credit: (micro) optimize csched_runq_steal().

2017-03-03 Thread anshul makkar



On 02/03/17 10:38, Dario Faggioli wrote:

Chacking whether or not a vCPU can be 'stolen'
from a peer pCPU's runqueue is relatively cheap.

Therefore, let's do that  as early as possible,
avoiding potentially useless complex checks, and
cpumask manipulations.

Signed-off-by: Dario Faggioli 
---
Cc: George Dunlap 
---
  xen/common/sched_credit.c |   17 +
  1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 63a8675..2b13e99 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -708,12 +708,10 @@ static inline int
  __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
  {
  /*
- * Don't pick up work that's in the peer's scheduling tail or hot on
- * peer PCPU. Only pick up work that prefers and/or is allowed to run
- * on our CPU.
+ * Don't pick up work that's or hot on peer PCPU, or that can't (or

Not clear.

+ * would prefer not to) run on cpu.
   */
-return !vc->is_running &&
-   !__csched_vcpu_is_cache_hot(vc) &&
+return !__csched_vcpu_is_cache_hot(vc) &&
 cpumask_test_cpu(dest_cpu, mask);
!vc->is_running doesn't ease the complexity and doesn't save much on cpu 
cycles. Infact, I think (!vc->is_running) makes the intention to check 
for migration much more clear to understand.


Yeah, apart from the above reasons, its save to remove this check from here.

  }
  
@@ -1622,7 +1620,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)

  BUG_ON( is_idle_vcpu(vc) );
  
  /*

- * If the vcpu has no useful soft affinity, skip this vcpu.
+ * If the vcpu is still in peer_cpu's scheduling tail, or if it
+ * has no useful soft affinity, skip it.
+ *
   * In fact, what we want is to check if we have any "soft-affine
   * work" to steal, before starting to look at "hard-affine work".
   *
@@ -1633,8 +1633,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int 
balance_step)
   * vCPUs with useful soft affinities in some sort of bitmap
   * or counter.
   */
-if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
- && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+if ( vc->is_running ||
+ (balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+  && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity)) )
  continue;
  
  csched_balance_cpumask(vc, balance_step, cpumask_scratch);



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit.

2017-03-03 Thread anshul makkar



On 02/03/17 10:38, Dario Faggioli wrote:

Since we're holding the lock on the pCPU from which we
are trying to steal, it can't have disappeared, so we
can drop the check for that (and convert it in an
ASSERT()).

And since we try to steal only from busy pCPUs, it's
unlikely for such pCPU to be idle, so we mark it as
such (and bail early if it unfortunately is).

Signed-off-by: Dario Faggioli 
---
Cc: George Dunlap 
---
  xen/common/sched_credit.c |   87 +++--
  1 file changed, 44 insertions(+), 43 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 4649e64..63a8675 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1593,64 +1593,65 @@ static struct csched_vcpu *
  csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
  {
  const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
-const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
  struct csched_vcpu *speer;
  struct list_head *iter;
  struct vcpu *vc;
  
+ASSERT(peer_pcpu != NULL);

+
  /*
   * Don't steal from an idle CPU's runq because it's about to
   * pick up work from it itself.
   */
-if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
+if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) )
+goto out;
We can just use if (!is_idle_vcpu(peer_vcpu)). Why to replace it with 
some code that introduces an unnecessary branch statement.

+
+list_for_each( iter, _pcpu->runq )
  {
-list_for_each( iter, _pcpu->runq )
-{
-speer = __runq_elem(iter);
+speer = __runq_elem(iter);
  
-/*

- * If next available VCPU here is not of strictly higher
- * priority than ours, this PCPU is useless to us.
- */
-if ( speer->pri <= pri )
-break;
+/*
+ * If next available VCPU here is not of strictly higher
+ * priority than ours, this PCPU is useless to us.
+ */
+if ( speer->pri <= pri )
+break;
  
-/* Is this VCPU runnable on our PCPU? */

-vc = speer->vcpu;
-BUG_ON( is_idle_vcpu(vc) );
+/* Is this VCPU runnable on our PCPU? */
+vc = speer->vcpu;
+BUG_ON( is_idle_vcpu(vc) );
  
-/*

- * If the vcpu has no useful soft affinity, skip this vcpu.
- * In fact, what we want is to check if we have any "soft-affine
- * work" to steal, before starting to look at "hard-affine work".
- *
- * Notice that, if not even one vCPU on this runq has a useful
- * soft affinity, we could have avoid considering this runq for
- * a soft balancing step in the first place. This, for instance,
- * can be implemented by taking note of on what runq there are
- * vCPUs with useful soft affinities in some sort of bitmap
- * or counter.
- */
-if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
- && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
-continue;
+/*
+ * If the vcpu has no useful soft affinity, skip this vcpu.
+ * In fact, what we want is to check if we have any "soft-affine
+ * work" to steal, before starting to look at "hard-affine work".
+ *
+ * Notice that, if not even one vCPU on this runq has a useful
+ * soft affinity, we could have avoid considering this runq for
+ * a soft balancing step in the first place. This, for instance,
+ * can be implemented by taking note of on what runq there are
+ * vCPUs with useful soft affinities in some sort of bitmap
+ * or counter.
+ */
Isn't it a better approach that now as we have came across a vcpu which 
doesn't have a desired soft affinity but is a potential candidate for 
migration, so instead of just forgetting it,  lets save the information 
for such vcpus in some data structure in some order so that we don't 
have to scan them again if we don't find anything useful in the present run.

+if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+ && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+continue;
  
-csched_balance_cpumask(vc, balance_step, cpumask_scratch);

-if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
-{
-/* We got a candidate. Grab it! */
-TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
- vc->domain->domain_id, vc->vcpu_id);
-SCHED_VCPU_STAT_CRANK(speer, migrate_q);
-SCHED_STAT_CRANK(migrate_queued);
-WARN_ON(vc->is_urgent);
-__runq_remove(speer);
-vc->processor = 

Re: [Xen-devel] XenGT GPU virtualization

2017-02-20 Thread anshul makkar



On 18/01/17 13:21, bharat gohil wrote:

Hello

I am new to GPU and GPU virtualization and found that xen support 
intel GPU virtualization using XenGT.

I want to know,
1) What are the critical GPU command pass from xen to Dom0?
2) How the Dom0 mediator or xen validate the GPU command which is 
passed from domU GPU driver?
3) If one of the domU guest send bad(malicious) command to GPU which 
led GPU to bad state. Can Dom0 mediator or xen prevents this kind of 
scenario?
As far as I know, there is know mediation to check for the commands. Xen 
does audit the target address space, but not GPU commands.


--
Regards,
Bharat Gohil



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] PML causing race condition during guest bootstorm and host crash on Broadwell cpu.

2017-02-13 Thread anshul makkar

Hi Jan,


On 09/02/17 16:22, Jan Beulich wrote:

On 08.02.17 at 16:32,  wrote:

On 07.02.17 at 18:26,  wrote:

Facing a issue where bootstorm of guests leads to host crash. I debugged
and found that that enabling PML  introduces a  race condition during
guest teardown stage while disabling PML on a vcpu  and context switch
happening for another vcpu.

Crash happens only on Broadwell processors as PML got introduced in this
generation.

Here is my analysis:

Race condition:

vmcs.c vmx_vcpu_disable_pml (vcpu){ vmx_vmcs_enter() ; vm_write(
disable_PML); vmx_vmcx_exit();)

In between I have a callpath from another pcpu executing context
switch-> vmx_fpu_leave() and crash on vmwrite..

if ( !(v->arch.hvm_vmx.host_cr0 & X86_CR0_TS) )
{
   v->arch.hvm_vmx.host_cr0 |= X86_CR0_TS;
   __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);  //crash
   }

So that's after current has changed already, so it's effectively
dealing with a foreign VMCS, but it doesn't use vmx_vmcs_enter().
The locking done in vmx_vmcs_try_enter() / vmx_vmcs_exit(),
however, assumes that any user of a VMCS either owns the lock
or has current as the owner of the VMCS. Of course such a call
also can't be added here, as a vcpu on the context-switch-from
path can't vcpu_pause() itself.

That taken together with the bypassing of __context_switch()
when the incoming vCPU is the idle one (which means that via
context_saved() ->is_running will be cleared before running
->ctxt_switch_from()), the vcpu_pause() invocation in
vmx_vmcs_try_enter() may not have to wait at all if the call
happens between the clearing of ->is_running and the
eventual invocation of vmx_ctxt_switch_from().

Mind giving the attached patch a try (which admittedly was only
lightly tested so far - in particular I haven't seen the second of
the debug messages being logged, yet)?
Patch looks promising. I couldn't do much thorough testing, but initial 
reboot cycles (around 20 reboots of 32 VMS) went fine.


Jan


Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] PML causing race condition during guest bootstorm and host crash on Broadwell cpu.

2017-02-08 Thread anshul makkar
May be "causes". But not sure, as disabling this feature solves the 
issue (no hypervisor crash) and we have reports that the same code base 
works fine on Haswell (again no hardware support for PML in Haswell, but 
code base is same. But inherently it leads to non-execution of PML code 
path, so no crash)


Anshul


On 08/02/17 15:37, Jan Beulich wrote:

On 08.02.17 at 15:58,  wrote:

Code is trying to destroy multiple vcpus held by the domain:
complete_domain_destroy->hvm_vcpu_destroy() for each vcpu.

In vmx_vcpu_destroy, we have a call for vmx_vcpu_disable_pml which leads
to a race while destroying foreign vcpu's with other domains rebooting
on the same vcpus .

With a single domain reboot, no race is observed.

Commit e18d4274772e52ac81fda1acb246d11ef666e5fe causes this race condition.

Causes or exposes? I ask because the similar reports we have are
all on 4.4.x, i.e. no PML code there at all.

Jan




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] PML causing race condition during guest bootstorm and host crash on Broadwell cpu.

2017-02-08 Thread anshul makkar

Hi,

Code is trying to destroy multiple vcpus held by the domain: 
complete_domain_destroy->hvm_vcpu_destroy() for each vcpu.


In vmx_vcpu_destroy, we have a call for vmx_vcpu_disable_pml which leads 
to a race while destroying foreign vcpu's with other domains rebooting 
on the same vcpus .


With a single domain reboot, no race is observed.

Commit e18d4274772e52ac81fda1acb246d11ef666e5fe causes this race condition.

Anshul


On 07/02/17 17:58, anshul makkar wrote:

Hi, Sorry, forgot to include you in cc list.

Anshul


On 07/02/17 17:26, anshul makkar wrote:

Hi,

Facing a issue where bootstorm of guests leads to host crash. I 
debugged and found that that enabling PML  introduces a  race 
condition during guest teardown stage while disabling PML on a vcpu  
and context switch happening for another vcpu.


Crash happens only on Broadwell processors as PML got introduced in 
this generation.


Here is my analysis:

Race condition:

vmcs.c vmx_vcpu_disable_pml (vcpu){ vmx_vmcs_enter() ; vm_write( 
disable_PML); vmx_vmcx_exit();)


In between I have a callpath from another pcpu executing context 
switch-> vmx_fpu_leave() and crash on vmwrite..


  if ( !(v->arch.hvm_vmx.host_cr0 & X86_CR0_TS) )
{
 v->arch.hvm_vmx.host_cr0 |= X86_CR0_TS;
 __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); //crash
 }

Debug logs
XEN) [221256.749928] VMWRITE VMCS Invalid !
(XEN) [221256.754870] **[00] { now c93b4341df1d, hw 
0035fffea000, op 0035fffea000 } vmclear
(XEN) [221256.765052] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221256.773969] **[01] { now c93b4341e099, hw 
, op 0035fffea000 } vmptrld
(XEN) [221256.784150] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221256.792197] **[02] { now c93b4341e1f1, hw 
0035fffea000, op 0035fffea000 } vmclear
(XEN) [221256.802378] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221256.811298] **[03] { now c93b5784dd0a, hw 
, op 0039d7074000 } vmptrld
(XEN) [221256.821478] ** frames [ 82d0801f4c31 
vmx_do_resume+0x51/0x150 ]


(XEN) [221256.829139] **[04] { now c93b59d67b5b, hw 
0039d7074000, op 002b9a575000 } vmptrld
(XEN) [221256.839320] ** frames [ 82d0801f4c31 
vmx_do_resume+0x51/0x150 ]


(XEN) [221256.882850] **[07] { now c93b59e71e48, hw 
002b9a575000, op 0039d7074000 } vmptrld
(XEN) [221256.893034] ** frames [ 82d0801f4d13 
vmx_do_resume+0x133/0x150 ]


(XEN) [221256.900790] **[08] { now c93b59e78675, hw 
0039d7074000, op 0040077ae000 } vmptrld
(XEN) [221256.910968] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221256.919015] **[09] { now c93b59e78ac8, hw 
0040077ae000, op 0040077ae000 } vmclear
(XEN) [221256.929196] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221256.938117] **[10] { now c93b59e78d72, hw 
, op 0040077ae000 } vmptrld
(XEN) [221256.948297] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221256.956345] **[11] { now c93b59e78ff0, hw 
0040077ae000, op 0040077ae000 } vmclear
(XEN) [221256.966525] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221256.975445] **[12] { now c93b59e7deda, hw 
, op 0040077b3000 } vmptrld
(XEN) [221256.985626] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221256.993672] **[13] { now c93b59e9fe00, hw 
0040077b3000, op 0040077b3000 } vmclear
(XEN) [221257.003852] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221257.012772] **[14] { now c93b59ea007e, hw 
, op 0040077b3000 } vmptrld
(XEN) [221257.022952] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221257.031000] **[15] { now c93b59ea02ba, hw 
0040077b3000, op 0040077b3000 } vmclear
(XEN) [221257.041180] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221257.050101]  
(XEN) [221257.053008] vmcs_ptr:0x, 
vcpu->vmcs:0x2b9a575000



vmcs is loaded and between the next call to vm_write, there is a 
clear of vmcs caused by vmx_vcpu_disable_pml.


Above log highlights that IPI is clearing the vmcs in between vmptrld 
and vmwrite  but I also verified that interrupts are disabled during 
context switch and execution of vm_write in vmx_fpu_leave.. This has 
got me confused.


Also, I am not sure if I understand the handling of foreign_vmcs 
correctly, which can also be the cause of the race.


Please if you can share some suggestions here.


Thanks

Anshul Makkar







___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel





___
Xen-devel mailing list
Xen-devel@lists.xen.org

[Xen-devel] PML causing race condition during guest bootstorm and host crash on Broadwell cpu.

2017-02-07 Thread anshul makkar

Hi,

Facing a issue where bootstorm of guests leads to host crash. I debugged 
and found that that enabling PML  introduces a  race condition during 
guest teardown stage while disabling PML on a vcpu  and context switch 
happening for another vcpu.


Crash happens only on Broadwell processors as PML got introduced in this 
generation.


Here is my analysis:

Race condition:

vmcs.c vmx_vcpu_disable_pml (vcpu){ vmx_vmcs_enter() ; vm_write( 
disable_PML); vmx_vmcx_exit();)


In between I have a callpath from another pcpu executing context 
switch-> vmx_fpu_leave() and crash on vmwrite..


  if ( !(v->arch.hvm_vmx.host_cr0 & X86_CR0_TS) )
{
 v->arch.hvm_vmx.host_cr0 |= X86_CR0_TS;
 __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);  //crash
 }

Debug logs
XEN) [221256.749928] VMWRITE VMCS Invalid !
(XEN) [221256.754870] **[00] { now c93b4341df1d, hw 
0035fffea000, op 0035fffea000 } vmclear
(XEN) [221256.765052] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221256.773969] **[01] { now c93b4341e099, hw 
, op 0035fffea000 } vmptrld
(XEN) [221256.784150] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221256.792197] **[02] { now c93b4341e1f1, hw 
0035fffea000, op 0035fffea000 } vmclear
(XEN) [221256.802378] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221256.811298] **[03] { now c93b5784dd0a, hw 
, op 0039d7074000 } vmptrld
(XEN) [221256.821478] ** frames [ 82d0801f4c31 
vmx_do_resume+0x51/0x150 ]


(XEN) [221256.829139] **[04] { now c93b59d67b5b, hw 
0039d7074000, op 002b9a575000 } vmptrld
(XEN) [221256.839320] ** frames [ 82d0801f4c31 
vmx_do_resume+0x51/0x150 ]


(XEN) [221256.882850] **[07] { now c93b59e71e48, hw 
002b9a575000, op 0039d7074000 } vmptrld
(XEN) [221256.893034] ** frames [ 82d0801f4d13 
vmx_do_resume+0x133/0x150 ]


(XEN) [221256.900790] **[08] { now c93b59e78675, hw 
0039d7074000, op 0040077ae000 } vmptrld
(XEN) [221256.910968] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221256.919015] **[09] { now c93b59e78ac8, hw 
0040077ae000, op 0040077ae000 } vmclear
(XEN) [221256.929196] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221256.938117] **[10] { now c93b59e78d72, hw 
, op 0040077ae000 } vmptrld
(XEN) [221256.948297] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221256.956345] **[11] { now c93b59e78ff0, hw 
0040077ae000, op 0040077ae000 } vmclear
(XEN) [221256.966525] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221256.975445] **[12] { now c93b59e7deda, hw 
, op 0040077b3000 } vmptrld
(XEN) [221256.985626] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221256.993672] **[13] { now c93b59e9fe00, hw 
0040077b3000, op 0040077b3000 } vmclear
(XEN) [221257.003852] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221257.012772] **[14] { now c93b59ea007e, hw 
, op 0040077b3000 } vmptrld
(XEN) [221257.022952] ** frames [ 82d0801f0765 
vmx_vmcs_try_enter+0x95/0xb0 ]


(XEN) [221257.031000] **[15] { now c93b59ea02ba, hw 
0040077b3000, op 0040077b3000 } vmclear
(XEN) [221257.041180] ** frames [ 82d080134652 
smp_call_function_interrupt+0x92/0xa0 ]


(XEN) [221257.050101]  
(XEN) [221257.053008] vmcs_ptr:0x, vcpu->vmcs:0x2b9a575000


vmcs is loaded and between the next call to vm_write, there is a clear 
of vmcs caused by vmx_vcpu_disable_pml.


Above log highlights that IPI is clearing the vmcs in between vmptrld 
and vmwrite  but I also verified that interrupts are disabled during 
context switch and execution of vm_write in vmx_fpu_leave.. This has got 
me confused.


Also, I am not sure if I understand the handling of foreign_vmcs 
correctly, which can also be the cause of the race.


Please if you can share some suggestions here.


Thanks

Anshul Makkar







___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86/VMX: Implement vmptrst

2017-01-11 Thread anshul makkar



On 11/01/17 05:37, Tian, Kevin wrote:

From: Anshul Makkar [mailto:anshul.mak...@citrix.com]
Sent: Friday, January 06, 2017 2:42 AM

Current codebase doesn't implement and use vmptrst. Implementing it as it may
be required in future.

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>

Then let's do it when it's really required. :-)


I needed it to debug an issue, found it missing and that's why I added.  
I have a sanity test, that continuously uses it.  There may be other 
users out there who can use it for similar or some other purpose. Having 
it in standard code base will do no harm.


Thanks
Kevin

Anshul


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] x86/VMX: Implement vmptrst

2017-01-05 Thread Anshul Makkar
Current codebase doesn't implement and use vmptrst. Implementing it as it may
be required in future.

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
 xen/include/asm-x86/hvm/vmx/vmx.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h 
b/xen/include/asm-x86/hvm/vmx/vmx.h
index e5c6499..2db6c1d 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -328,6 +328,28 @@ static always_inline void __vmptrld(u64 addr)
: "memory");
 }
 
+static always_inline u64 __vmptrst(void)
+{
+u64 paddr;
+
+asm volatile (
+#ifdef HAVE_GAS_VMX
+   "vmptrst %0\n"
+#else
+   VMPTRST_OPCODE MODRM_EAX_07
+#endif
+
+#ifdef HAVE_GAS_VMX
+   : "=m" (paddr)
+   :
+#else
+   :
+   : "a" (),
+#endif
+   : "memory");
+return paddr;
+}
+
 static always_inline void __vmpclear(u64 addr)
 {
 asm volatile (
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.8 + Linux 4.9 + Credit2 = can't bootup

2017-01-05 Thread Anshul Makkar
On 05/01/2017 08:39, Dario Faggioli wrote:
> On Thu, 2017-01-05 at 02:05 +, Andrew Cooper wrote:
>> On 05/01/2017 01:52, Konrad Rzeszutek Wilk wrote:
>>> It works just fine with credit1 (see further down the log)
>>> but if I try credit2 it ends up hanging during bootup.
>>>
>>> I am a going to naively assume it is due to how the vCPUs are
>>> exposed (Where they match the physical CPUs under credit1),
>>> but under credit2 they are different.
>>>
>>> The dom0_max_vcpus does not seem to have any affect. When I remove
>>> it
>>> things are still being problematic.
>>>
>>> Help!?
>>
>> This matches the symptoms seen by XenServer when trying to stress
>> 32vcpu
>> guests under Credit2.  Malcolm did find (based on interpreted iperf
>> throughput graphs) that Credit2 did seem to preferentially schedule
>> the
>> lower-number vcpus, rather than scheduling them evenly.
>>
> To be fair (and just for the records, since the cause seems actually to
> be something else), this was with an old version (at least two Xen
> releases ago, IIRC, certainly not 4.8) and known to be buggy version of
> Credit2.
>
> We have other tests and benchmarks, done on equally big machines which
> proves the scheduler is 100% functional.
>
Yes, I have done extensive stress tests on XenSever(IIRC Xen - 4.7) with 
Credit2 on 32 or more VCPU guests. Didn't see any hangups.

Malcolm, did find hangup and crash issues with 32VCPU guests, but the 
issue was highlighted and fixed. So, during my testing I specifically 
foucussed on this scenario and found no problem.

> Regards,
> Dario
>

Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xsm: allow relevant permission during migrate and gpu-passthrough.

2017-01-04 Thread anshul makkar



On 03/01/17 18:20, Daniel De Graaf wrote:

On 12/19/2016 11:03 PM, Doug Goldstein wrote:

On 12/19/16 10:02 AM, Doug Goldstein wrote:

On 12/14/16 3:09 PM, Daniel De Graaf wrote:

On 12/12/2016 09:00 AM, Anshul Makkar wrote:

During guest migrate allow permission to prevent
spurious page faults.
Prevents these errors:
d73: Non-privileged (73) attempt to map I/O space 

avc: denied  { set_misc_info } for domid=0 target=11
scontext=system_u:system_r:dom0_t
tcontext=system_u:system_r:domU_t tclass=domain

GPU passthrough for hvm guest:
avc:  denied  { send_irq } for domid=0 target=10
scontext=system_u:system_r:dom0_t
tcontext=system_u:system_r:domU_t tclass=hvm

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>


Acked-by: Daniel De Graaf <dgde...@tycho.nsa.gov>



Daniel,

Should this be backported to 4.8?



Yes, I would consider this a candidate for backporting.


FWIW, Daniel's email is bouncing. Anshul, do you want to test/confirm?


I believe this is fixed now; my email server was changed while I was gone
for the holiday and apparently the change was not tested properly.


Please backport the patch to stable-4.8. I have tested it.

Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Future support of 5-level paging in Xen

2017-01-03 Thread anshul makkar



On 08/12/16 23:40, Boris Ostrovsky wrote:



On 12/08/2016 05:21 PM, Andrew Cooper wrote:

On 08/12/2016 19:18, Stefano Stabellini wrote:





Of course even the largest virtual machine today (2TB on Amazon AFAIK)
is not close to reaching the current memory limit, but it's just a
matter of time.


/me things Oracle will have something to say about this.  I'm sure there
was talk about VMs larger than this at previous hackathons. XenServer
functions (ish, so long as you don't migrate) with 6TB VMs, although
starting and shutting them down feels like treacle.




I've been working (on and off) with SGI to get one of their 32TB boxes 
to boot and I don't think that works. We've fixed a couple of bugs but 
I don't think Xen can boot with that much memory. We successfully 
booted with just under 8TB but couldn't do it with the full system. 
The machine has been taken from us for now so this work is on hold.


This is on OVM, which is 4.4-based, we haven't tried (IIRC) latest bits.

(BTW, speaking of slow starting and shutting down very large guests ---
have you or anyone else had a chance to look at this? My investigation 
initially pointed to scrubbing and then to an insane number of 
hypercall preemptions in relinquish_memory()).


I had a quick look at it when I was working on support for large guest 
and found that scrubbing was indeed one of the issue. Just haven't got 
time to look at it in more details. Hopefully in near future, might work 
on it.



-boris

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xsm: allow relevant permission during migrate and gpu-passthrough.

2016-12-20 Thread Anshul Makkar

On 20/12/2016 04:03, Doug Goldstein wrote:

On 12/19/16 10:02 AM, Doug Goldstein wrote:

On 12/14/16 3:09 PM, Daniel De Graaf wrote:

On 12/12/2016 09:00 AM, Anshul Makkar wrote:

During guest migrate allow permission to prevent
spurious page faults.
Prevents these errors:
d73: Non-privileged (73) attempt to map I/O space 

avc: denied  { set_misc_info } for domid=0 target=11
scontext=system_u:system_r:dom0_t
tcontext=system_u:system_r:domU_t tclass=domain

GPU passthrough for hvm guest:
avc:  denied  { send_irq } for domid=0 target=10
scontext=system_u:system_r:dom0_t
tcontext=system_u:system_r:domU_t tclass=hvm

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>


Acked-by: Daniel De Graaf <dgde...@tycho.nsa.gov>



Daniel,

Should this be backported to 4.8?



FWIW, Daniel's email is bouncing. Anshul, do you want to test/confirm?



Doug, yes, will backport and test.

Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] xsm: allow relevant permission during migrate and gpu-passthrough.

2016-12-12 Thread Anshul Makkar
During guest migrate allow permission to prevent
spurious page faults.
Prevents these errors:
d73: Non-privileged (73) attempt to map I/O space 

avc: denied  { set_misc_info } for domid=0 target=11
scontext=system_u:system_r:dom0_t
tcontext=system_u:system_r:domU_t tclass=domain

GPU passthrough for hvm guest:
avc:  denied  { send_irq } for domid=0 target=10
scontext=system_u:system_r:dom0_t
tcontext=system_u:system_r:domU_t tclass=hvm

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---

 tools/flask/policy/modules/xen.if |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/flask/policy/modules/xen.if 
b/tools/flask/policy/modules/xen.if
index eb646f5..1aca75d 100644
--- a/tools/flask/policy/modules/xen.if
+++ b/tools/flask/policy/modules/xen.if
@@ -49,7 +49,7 @@ define(`create_domain_common', `
allow $1 $2:domain { create max_vcpus setdomainmaxmem setaddrsize
getdomaininfo hypercall setvcpucontext getscheduler
getvcpuinfo getaddrsize getaffinity setaffinity
-   settime setdomainhandle getvcpucontext };
+   settime setdomainhandle getvcpucontext set_misc_info };
allow $1 $2:domain2 { set_cpuid settsc setscheduler setclaim
set_max_evtchn set_vnumainfo get_vnumainfo cacheflush
psr_cmt_op psr_cat_op soft_reset };
@@ -58,7 +58,7 @@ define(`create_domain_common', `
allow $1 $2:mmu { map_read map_write adjust memorymap physmap pinpage 
mmuext_op updatemp };
allow $1 $2:grant setup;
allow $1 $2:hvm { cacheattr getparam hvmctl irqlevel pciroute sethvmc
-   setparam pcilevel trackdirtyvram nested altp2mhvm 
altp2mhvm_op };
+   setparam pcilevel trackdirtyvram nested altp2mhvm 
altp2mhvm_op send_irq };
 ')
 
 # create_domain(priv, target)
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] xsm support for restrictive policy and dynamic modification

2016-12-06 Thread anshul makkar

Hi,

In Xen-Summit, Toronto, I did a presentation on XSM where two 
shortcomings with the present implementation were highlighted:


1) Allow more finer grain control. e.g all introspection domain doesn't 
need to have pci-passthrough capability while few need to have and all 
should have introspection capability.


2) Need for dynamic modification, loading/unloading of the policy.

In response to the above I got few suggestion from the community (Thanks 
Daniel De Graf and Stephen Smalley and others) and after researching the 
alternatives I found that already existent XSM support for MCS and MLS 
and type bound can fit in for finer grain control.


Regarding dynamic support, its something that is still desired and not 
implemented. Stephan shared few links related to these:


this:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.7129=rep1=pdf

http://oss.tresys.com/archive/policy-server.php

Going forward, I will be focusing towards dynamic implementation as I 
believe currently existing infrastructure is sufficient enough to 
provide finer level of control over domains.


Thanks
Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 04/10] xen: credit2: only reset credit on reset condition

2016-09-30 Thread anshul makkar

On 30/09/16 03:53, Dario Faggioli wrote:

The condition for a Credit2 scheduling epoch coming to an
end is that the vcpu at the front of the runqueue has negative
credits. However, it is possible, that runq_candidate() does
not actually return to the scheduler the first vcpu in the
runqueue (e.g., because such vcpu can't run on the cpu that
is going through the scheduler, because of hard-affinity).

If that happens, we should not trigger a credit reset, or we
risk altering the lenght of a scheduler epoch, wrt what the
original idea of the algorithm was.

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>
---
Cc: George Dunlap <george.dun...@eu.citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
---

 /*
  * If a vcpu is meant to be picked up by another processor, and such
@@ -2282,6 +2288,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
 if ( svc->tickled_cpu != -1 && svc->tickled_cpu != cpu &&
  cpumask_test_cpu(svc->tickled_cpu, >tickled) )
 {
+(*skipped)++;
 SCHED_STAT_CRANK(deferred_to_tickled_cpu);
 continue;
 }
@@ -2291,6 +2298,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
 if ( svc->vcpu->processor != cpu
  && snext->credit + CSCHED2_MIGRATE_RESIST > svc->credit )
 {
+(*skipped)++;
 SCHED_STAT_CRANK(migrate_resisted);
 continue;
 }
@@ -2308,11 +2316,12 @@ runq_candidate(struct csched2_runqueue_data *rqd,
 {
 struct {
 unsigned vcpu:16, dom:16;
-unsigned tickled_cpu;
+unsigned tickled_cpu, skipped;
 } d;
 d.dom = snext->vcpu->domain->domain_id;
 d.vcpu = snext->vcpu->vcpu_id;
 d.tickled_cpu = snext->tickled_cpu;
+d.skipped = *skipped;
 __trace_var(TRC_CSCHED2_RUNQ_CANDIDATE, 1,
 sizeof(d),
 (unsigned char *));
@@ -2336,6 +2345,7 @@ csched2_schedule(
 struct csched2_runqueue_data *rqd;
 struct csched2_vcpu * const scurr = CSCHED2_VCPU(current);
 struct csched2_vcpu *snext = NULL;
+unsigned int skipped_vcpus = 0;
 struct task_slice ret;

 SCHED_STAT_CRANK(schedule);
@@ -2385,7 +2395,7 @@ csched2_schedule(
 snext = CSCHED2_VCPU(idle_vcpu[cpu]);
 }
 else
-snext = runq_candidate(rqd, scurr, cpu, now);
+snext = runq_candidate(rqd, scurr, cpu, now, _vcpus);

 /* If switching from a non-idle runnable vcpu, put it
  * back on the runqueue. */
@@ -2409,8 +2419,21 @@ csched2_schedule(
 __set_bit(__CSFLAG_scheduled, >flags);
 }

-/* Check for the reset condition */
-if ( snext->credit <= CSCHED2_CREDIT_RESET )
+/*
+ * The reset condition is "has a scheduler epoch come to an end?".
+ * The way this is enforced is checking whether the vcpu at the top
+ * of the runqueue has negative credits. This means the epochs have
+ * variable lenght, as in one epoch expores when:
+ *  1) the vcpu at the top of the runqueue has executed for
+ * around 10 ms (with default parameters);
+ *  2) no other vcpu with higher credits wants to run.
+ *
+ * Here, where we want to check for reset, we need to make sure the
+ * proper vcpu is being used. In fact, runqueue_candidate() may have
+ * not returned the first vcpu in the runqueue, for various reasons
+ * (e.g., affinity). Only trigger a reset when it does.
+ */
+if ( skipped_vcpus == 0 && snext->credit <= CSCHED2_CREDIT_RESET )
 {
Is there no other way of checking that the returned vcpu from 
runqueue_candidate is the first one apart from using "skipped_vcpu" 
extra variable. Looks a bit ugly.

 reset_credit(ops, cpu, now, snext);
 balance_load(ops, cpu, now);




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 17/24] xen: credit2: soft-affinity awareness in runq_tickle()

2016-09-07 Thread anshul makkar

On 05/09/16 15:55, Dario Faggioli wrote:

On Thu, 2016-09-01 at 11:52 +0100, anshul makkar wrote:

On 17/08/16 18:19, Dario Faggioli wrote:


+/*
+ * We're doing soft-affinity, and we know that the current
vcpu on cpu
+ * has a soft affinity. We now want to know whether cpu itself
is in

Please can you explain the above statment. If the vcpu has soft
affinity
and its currently executing, doesn;t it always means that its running
on
one of the pcpu which is there in its soft affinity or hard affinity?


A vcpu will always run on a pcpu from its own hard-affinity (that's the
definition of hard-affinity).

On the other hand, a vcpu will, most of the time, run on a cpu from its
own soft affinity, but can run on a cpu that is in its hard-affinity,
but *IS NOT* in its soft-affinity.

That's the definition of soft-affinity: the scheduler will try to run
it there, but it that can't happen, it will run it will run it outside
of it (but still within its hard-affinity, of course).

So, yes, we know already that it's running in a cpu at least from its
hard affinity, what is it exactly that you are not understanding?


If I put it simply ,  can  (X being a vcpu)
x {soft affinity pcpus} Intersect x { hard affinity pcpu} -> be Null or 
disjoint set ?

and

x{runnable pcpu} intersect (x{hard affinity pcpu} union x{soft affinity 
pcpu} ) -> be null or disjoint ??





+ * such affinity. In fact, since we now that new (in
runq_tickle()) is

Typo:   * such affinity. In fact, since now we know that new (in
runq_tickle()) is


Thanks. :-)



+ */



Regards,
Dario


Anshul Makkar

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 18/24] xen: credit2: soft-affinity awareness fallback_cpu() and cpu_pick()

2016-09-07 Thread anshul makkar

On 05/09/16 14:26, Dario Faggioli wrote:

On Thu, 2016-09-01 at 12:08 +0100, anshul makkar wrote:

On 17/08/16 18:19, Dario Faggioli wrote:



Can't we
just read their workload or we can change the locktype to allow
reading ?


Reading without taking the lock would race against the load value being
updated. And updating the load is done by __update_runq_load(), which,
with all it's shifts and maths, by no means is an atomic operation.

So it's not just a matter of risking to read a slightly outdated value,
which, I agree, may not be that bad, it's that we risk reading
something inconsistent. :-/


Ok. Got it and agree.

About "changing the locktype", I guess you mean that we can turn also
the runqueue lock into an rw-lock? If yes, that's indeed interesting,
and I've also thought about it, but, for now, always deferred trying to
actually do that.

Yes.


It's technically non trivial, as it would involve changing
schedule_data->schedule_lock and all the {v,p}cpu_schedule_lock*()
functions. Also, it's a lock that will almost all the times be taken
for writing, which usually means what you want is a proper spinlock.

So, IMO, before embarking in doing something like that, it'd be good to
figure out how frequently we actually fail to take the remote runqueue
lock, and what's the real impact of having to deal the consequence of
that.


Ok. Lets discuss on that to finalize the approach.

I'm not saying it's not worth a try, but I'm saying that's it's
something at high risk of being a lot of work for a very small gain,
and that there are more important things to focus on.


+ * we pick, in order of decreasing preference:
+ *  1) svc's current pcpu, if it is part of svc's soft affinity;
+ *  2) a pcpu in svc's current runqueue that is also in svc's soft
affinity;

svc's current runqueue. Do you mean the runq in which svc is
currently
queued ?


I mean the runqueue to which svc is currently assigned (again, as per
runq_assign()/runq_deassing()), which in turns mean that, if svc is
queued in a runqueue, it's queues there (so, I guess the short answer
to your question is "yes" :-D).


Ok.

Regards,
Dario


Anshul


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 05/24] xen: credit2: make tickling more deterministic

2016-09-07 Thread anshul makkar

On 05/09/16 14:47, Dario Faggioli wrote:

On Wed, 2016-08-31 at 18:10 +0100, anshul makkar wrote:

On 17/08/16 18:18, Dario Faggioli wrote:


@@ -1266,6 +1272,7 @@ csched2_alloc_vdata(const struct scheduler
*ops, struct vcpu *vc, void *dd)
   ASSERT(svc->sdom != NULL);
   svc->credit = CSCHED2_CREDIT_INIT;
   svc->weight = svc->sdom->weight;
+svc->tickled_cpu = -1;
   /* Starting load of 50% */
   svc->avgload = 1ULL << (CSCHED2_PRIV(ops)-

load_precision_shift - 1);

   svc->load_last_update = NOW() >>
LOADAVG_GRANULARITY_SHIFT;
@@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct scheduler
*ops, struct vcpu *vc, void *dd)
   else
   {
   ASSERT(svc->sdom == NULL);
+svc->tickled_cpu = svc->vcpu->vcpu_id;

If I understood correctly, tickled_cpu refers to pcpu and not a
vcpu.
Saving vcpu_id in tickled_cpu looks wrong.


Yes, and in fact, as you can see in the previous hunk, for pretty much
all vcpus, tickled_cpu is initialized to -1.

Here, we are dealing with the vcpus of the idle domain. And for vcpus
of the idle domain, their vcpu id is the same as the id of the pcpu
they're associated to.

Ah, that makes it clear .


I agree it looks a little bit weird, but it's both correct, and the
easiest and cleanest way for initializing this.

I guess I can add a comment...

That will be useful.


Thanks and Regards,
Dario


Thanks
Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 23/24] xen: credit2: optimize runq_tickle() a little bit

2016-09-02 Thread anshul makkar

On 17/08/16 18:20, Dario Faggioli wrote:

By not looking at the same cpu (to check whether
we want to preempt who's running there) twice, if
the vcpu being woken up has both soft and hard
affinity.

In fact, all the cpus that are part of both soft
affinity and hard-affinity (of the waking vcpu)
are checked during the soft-affinity balancing
step. If none turns out to be suitable, e.g.,
because they're running vcpus with higher credits,
there's no point in re-checking them, only to
re-assess the same, during the hard-affinity
step.

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>
---
Cc: George Dunlap <george.dun...@citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
---
  xen/common/sched_credit2.c |   43 +++
  1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 6963872..f03ecce 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -997,7 +997,7 @@ runq_tickle(const struct scheduler *ops, struct 
csched2_vcpu *new, s_time_t now)
  s_time_t lowest = (1<<30);
  unsigned int bs, cpu = new->vcpu->processor;
  struct csched2_runqueue_data *rqd = RQD(ops, cpu);
-cpumask_t mask;
+cpumask_t mask, skip_mask;
  struct csched2_vcpu * cur;

  ASSERT(new->rqd == rqd);
@@ -1017,6 +1017,13 @@ runq_tickle(const struct scheduler *ops, struct 
csched2_vcpu *new, s_time_t now)
  (unsigned char *));
  }

+/*
+ * Cpus that end up in this mask, have been already checked during the
+ * soft-affinity step, and need not to be checked again when doing hard
+ * affinity.
+ */
+cpumask_clear(_mask);
+
  for_each_affinity_balance_step( bs )
  {
  /*
@@ -1073,7 +1080,8 @@ runq_tickle(const struct scheduler *ops, struct 
csched2_vcpu *new, s_time_t now)
  cpumask_andnot(, >active, >idle);
  cpumask_andnot(, , >tickled);
  cpumask_and(, , cpumask_scratch);
-if ( cpumask_test_cpu(cpu, ) )
+if ( cpumask_test_cpu(cpu, ) &&
+ !cpumask_test_cpu(cpu, _mask) )
  {
  cur = CSCHED2_VCPU(curr_on_cpu(cpu));

@@ -1102,13 +1110,26 @@ runq_tickle(const struct scheduler *ops, struct 
csched2_vcpu *new, s_time_t now)
  ipid = cpu;
  goto tickle;
  }
+
+/*
+ * If we're here, cpu is just not a valid candidate for being
+ * tickled. Set its bit in skip_mask, to avoid calling
+ * burn_credits() and check its current vcpu for preemption
+ * twice.
+ */
+__cpumask_set_cpu(cpu, _mask);
  }
  }

  for_each_cpu(i, )
  {
-/* Already looked at this one above */
-if ( i == cpu )
+/*
+ * Already looked at these ones above, either because it's the
+ * cpu where new was running before, or because we are at the
+ * hard-affinity step, and we checked this during the
+ * soft-affinity one
+ */
Sorry for my naiveness here, but can we be sure that situation has not 
changed since we checked during soft-affinity step. ?

+if ( i == cpu || cpumask_test_cpu(i, _mask) )
  continue;

  cur = CSCHED2_VCPU(curr_on_cpu(i));
@@ -1139,6 +1160,20 @@ runq_tickle(const struct scheduler *ops, struct 
csched2_vcpu *new, s_time_t now)
  ipid = i;
  lowest = cur->credit;
  }
+
+/*
+ * No matter if i is the new lowest or not. We've run
+ * burn_credits() on it, and we've checked it for preemption.
+ *
+ * If we are at soft-affinity balancing step, and i is indeed
+ * the lowest, it will be tickled (and we exit the function).
+ * If it is not the lowest among the cpus in the soft-affinity
+ * mask, it can't be the lowest among the cpus in the hard
+ * affinity mask (assuming we'll actually do the second
+ * balancing step), as hard-affinity is a superset of soft
+ * affinity, and therefore we can flag it to be skipped.
+ */
+__cpumask_set_cpu(i, _mask);
  }
  }





___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 19/24] xen: credit2: soft-affinity awareness in load balancing

2016-09-02 Thread anshul makkar

On 17/08/16 18:19, Dario Faggioli wrote:

We want is soft-affinity to play a role in load
balancing, i.e., when deciding whether or not to



something like that at some point.

(Oh, and while there, just a couple of style fixes
are also done.)

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>
---
Cc: George Dunlap <george.dun...@citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
---
  xen/common/sched_credit2.c |  359 
  1 file changed, 326 insertions(+), 33 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 2d7228a..3722f46 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1786,19 +1786,21 @@ csched2_cpu_pick(const struct scheduler *ops, struct 
vcpu *vc)
  return new_cpu;
  }

-/* Working state of the load-balancing algorithm */
+/* Working state of the load-balancing algorithm. */
  typedef struct {
-/* NB: Modified by consider() */
+/* NB: Modified by consider(). */
  s_time_t load_delta;
  struct csched2_vcpu * best_push_svc, *best_pull_svc;
-/* NB: Read by consider() */
+/* NB: Read by consider() (and the various consider_foo() functions). */
  struct csched2_runqueue_data *lrqd;
-struct csched2_runqueue_data *orqd;
+struct csched2_runqueue_data *orqd;
+bool_t push_has_soft_aff, pull_has_soft_aff;
+s_time_t push_soft_aff_load, pull_soft_aff_load;
  } balance_state_t;

-static void consider(balance_state_t *st,
- struct csched2_vcpu *push_svc,
- struct csched2_vcpu *pull_svc)
+static inline s_time_t consider_load(balance_state_t *st,
+ struct csched2_vcpu *push_svc,
+ struct csched2_vcpu *pull_svc)
  {
  s_time_t l_load, o_load, delta;

@@ -1821,11 +1823,166 @@ static void consider(balance_state_t *st,
  if ( delta < 0 )
  delta = -delta;

+return delta;
+}
+
+/*
+ * Load balancing and soft-affinity.
+ *
+ * When trying to figure out whether or not it's best to move a vcpu from
+ * one runqueue to another, we must keep soft-affinity in mind. Intuitively
+ * we would want to know the following:
+ *  - 'how much' affinity does the vcpu have with its current runq?
+ *  - 'how much' affinity will it have with its new runq?
+ *
+ * But we certainly need to be more precise about how much it is that 'how
+ * much'! Let's start with some definitions:
+ *
+ *  - let v be a vcpu, running in runq I, with soft-affinity to vi
+ *pcpus of runq I, and soft affinity with vj pcpus of runq J;
+ *  - let k be another vcpu, running in runq J, with soft-affinity to kj
+ *pcpus of runq J, and with ki pcpus of runq I;
+ *  - let runq I have Ci pcpus, and runq J Cj pcpus;
+ *  - let vcpu v have an average load of lv, and k an average load of lk;
+ *  - let runq I have an average load of Li, and J an average load of Lj.
+ *
+ * We also define the following::
+ *
+ *  - lvi = lv * (vi / Ci)  as the 'perceived load' of v, when running
+ *  in runq i;
+ *  - lvj = lv * (vj / Cj)  as the 'perceived load' of v, it running
+ *  in runq j;
+ *  - the same for k, mutatis mutandis.
+ *
+ * Idea is that vi/Ci (i.e., the ratio of the number of cpus of a runq that
+ * a vcpu has soft-affinity with, over the total number of cpus of the runq
+ * itself) can be seen as the 'degree of soft-affinity' of v to runq I (and
+ * vj/Cj the one of v to J). In other words, we define the degree of soft
+ * affinity of a vcpu to a runq as what fraction of pcpus of the runq itself
+ * the vcpu has soft-affinity with. Then, we multiply this 'degree of
+ * soft-affinity' by the vcpu load, and call the result the 'perceived load'.
+ *
+ * Basically, if a soft-affinity is defined, the work done by a vcpu on a
+ * runq to which it has higher degree of soft-affinity, is considered
+ * 'lighter' than the same work done by the same vcpu on a runq to which it
+ * has smaller degree of soft-affinity (degree of soft affinity is <= 1). In
+ * fact, if soft-affinity is used to achieve NUMA-aware scheduling, the higher
+ * the degree of soft-affinity of the vcpu to a runq, the greater the 
probability
+ * of accessing local memory, when running on such runq. And that is certainly\
+ * 'lighter' than having to fetch memory from remote NUMA nodes.
Do we ensure that while defining soft-affinity for a vcpu, NUMA 
architecture is considered. If not, then this whole calculation can go 
wrong and have negative impact on performance.


Degree of affinity to runq will give good result if the affinity to 
pcpus has been chosen after due consideration ..

+ *
+ * SoXX, evaluating pushing v from I to J would mean removing (from I) a
+ * perceived load of lv*(vi/Ci) and adding (to J) a perceived load of
+ * lv*(vj/Cj), which we (looking at things from the point of view of I,
+ * which is what balan

Re: [Xen-devel] [PATCH 18/24] xen: credit2: soft-affinity awareness fallback_cpu() and cpu_pick()

2016-09-01 Thread anshul makkar

On 17/08/16 18:19, Dario Faggioli wrote:

For get_fallback_cpu(), by putting in place the "usual"
two steps (soft affinity step and hard affinity step)
loop. We just move the core logic of the function inside
the body of the loop itself.

For csched2_cpu_pick(), what is important is to find
the runqueue with the least average load. Currently,
we do that by looping on all runqueues and checking,
well, their load. For soft affinity, we want to know
which one is the runqueue with the least load, among
the ones where the vcpu would prefer to be assigned.

We find both the least loaded runqueue among the soft
affinity "friendly" ones, and the overall least loaded
one, in the same pass.

(Also, kill a spurious ';' when defining MAX_LOAD.)

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>
Signed-off-by: Justin T. Weaver <jtwea...@hawaii.edu>
---
Cc: George Dunlap <george.dun...@citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
---
  xen/common/sched_credit2.c |  136 
  1 file changed, 111 insertions(+), 25 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 3aef1b4..2d7228a 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -506,34 +506,68 @@ void smt_idle_mask_clear(unsigned int cpu, cpumask_t 
*mask)
  }

  /*
- * When a hard affinity change occurs, we may not be able to check some
- * (any!) of the other runqueues, when looking for the best new processor
- * for svc (as trylock-s in csched2_cpu_pick() can fail). If that happens, we
- * pick, in order of decreasing preference:
- *  - svc's current pcpu;
- *  - another pcpu from svc's current runq;
- *  - any cpu.
+ * In csched2_cpu_pick(), it may not be possible to actually look at remote
+ * runqueues (the trylock-s on their spinlocks can fail!). If that happens,
With remote runqueues , do you mean runqs on remote socket ? Can't we 
just read their workload or we can change the locktype to allow reading ?

+ * we pick, in order of decreasing preference:
+ *  1) svc's current pcpu, if it is part of svc's soft affinity;
+ *  2) a pcpu in svc's current runqueue that is also in svc's soft affinity;
svc's current runqueue. Do you mean the runq in which svc is currently 
queued ?

+ *  3) just one valid pcpu from svc's soft affinity;
+ *  4) svc's current pcpu, if it is part of svc's hard affinity;
+ *  5) a pcpu in svc's current runqueue that is also in svc's hard affinity;
+ *  6) just one valid pcpu from svc's hard affinity
+ *
+ * Of course, 1, 2 and 3 makes sense only if svc has a soft affinity. Also
+ * note that at least 6 is guaranteed to _always_ return at least one pcpu.
   */
  static int get_fallback_cpu(struct csched2_vcpu *svc)
  {
  int cpu;
+unsigned int bs;

-if ( likely(cpumask_test_cpu(svc->vcpu->processor,
- svc->vcpu->cpu_hard_affinity)) )
-return svc->vcpu->processor;
+for_each_affinity_balance_step( bs )
+{
+if ( bs == BALANCE_SOFT_AFFINITY &&
+ !has_soft_affinity(svc->vcpu, svc->vcpu->cpu_hard_affinity) )
+continue;



Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 17/24] xen: credit2: soft-affinity awareness in runq_tickle()

2016-09-01 Thread anshul makkar

On 17/08/16 18:19, Dario Faggioli wrote:

This is done by means of the "usual" two steps loop:
  - soft affinity balance step;
  - hard affinity balance step.

The entire logic implemented in runq_tickle() is
applied, during the first step, considering only the
CPUs in the vcpu's soft affinity. In the second step,
we fall back to use all the CPUs from its hard
affinity (as it is doing now, without this patch).

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>
Signed-off-by: Justin T. Weaver <jtwea...@hawaii.edu>
---
Cc: George Dunlap <george.dun...@citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
---
  xen/common/sched_credit2.c |  243 
  1 file changed, 157 insertions(+), 86 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 0d83bd7..3aef1b4 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -902,6 +902,42 @@ __runq_remove(struct csched2_vcpu *svc)
  list_del_init(>runq_elem);
  }

+/*
+ * During the soft-affinity step, only actually preempt someone if
+ * he does not have soft-affinity with cpu (while we have).
+ *
+ * BEWARE that this uses cpumask_scratch, trowing away what's in there!

Typo:* BEWARE that this uses cpumask_scratch, throwing away what's in there!


+ */
+static inline bool_t soft_aff_check_preempt(unsigned int bs, unsigned int cpu)
+{
+struct csched2_vcpu * cur = CSCHED2_VCPU(curr_on_cpu(cpu));
+
+/*
+ * If we're doing hard-affinity, always check whether to preempt cur.
+ * If we're doing soft-affinity, but cur doesn't have one, check as well.
+ */
+if ( bs == BALANCE_HARD_AFFINITY ||
+ !has_soft_affinity(cur->vcpu, cur->vcpu->cpu_hard_affinity) )
+return 1;
+
+/*
+ * We're doing soft-affinity, and we know that the current vcpu on cpu
+ * has a soft affinity. We now want to know whether cpu itself is in
Please can you explain the above statment. If the vcpu has soft affinity 
and its currently executing, doesn;t it always means that its running on 
one of the pcpu which is there in its soft affinity or hard affinity?

+ * such affinity. In fact, since we now that new (in runq_tickle()) is
Typo:   * such affinity. In fact, since now we know that new (in 
runq_tickle()) is

+ *  - if cpu is not in cur's soft-affinity, we should indeed check to
+ *see whether new should preempt cur. If that will be the case, that
+ *would be an improvement wrt respecting soft affinity;
+ *  - if cpu is in cur's soft-affinity, we leave it alone and (in
+ *runq_tickle()) move on to another cpu. In fact, we don't want to
+ *be too harsh with someone which is running within its soft-affinity.
+ *This is safe because later, if we don't fine anyone else during the
+ *soft-affinity step, we will check cpu for preemption anyway, when
+ *doing hard-affinity.
+ */
+affinity_balance_cpumask(cur->vcpu, BALANCE_SOFT_AFFINITY, 
cpumask_scratch);
+return !cpumask_test_cpu(cpu, cpumask_scratch);
+}
+
  void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *, 
s_time_t);

  /*
@@ -925,7 +961,7 @@ runq_tickle(const struct scheduler *ops, struct 
csched2_vcpu *new, s_time_t now)
  {
  int i, ipid = -1;
  s_time_t lowest = (1<<30);
-unsigned int cpu = new->vcpu->processor;
+unsigned int bs, cpu = new->vcpu->processor;
  struct csched2_runqueue_data *rqd = RQD(ops, cpu);
  cpumask_t mask;
  struct csched2_vcpu * cur;
@@ -947,109 +983,144 @@ runq_tickle(const struct scheduler *ops, struct 
csched2_vcpu *new, s_time_t now)
  (unsigned char *));
  }

-/*
- * First of all, consider idle cpus, checking if we can just
- * re-use the pcpu where we were running before.
- *
- * If there are cores where all the siblings are idle, consider
- * them first, honoring whatever the spreading-vs-consolidation
- * SMT policy wants us to do.
- */
-if ( unlikely(sched_smt_power_savings) )
-cpumask_andnot(, >idle, >smt_idle);
-else
-cpumask_copy(, >smt_idle);
-cpumask_and(, , new->vcpu->cpu_hard_affinity);
-i = cpumask_test_or_cycle(cpu, );
-if ( i < nr_cpu_ids )
+for_each_affinity_balance_step( bs )
  {
-SCHED_STAT_CRANK(tickled_idle_cpu);
-ipid = i;
-goto tickle;
-}
+/*
+ * First things first: if we are at the first (soft affinity) step,
+ * but new doesn't have a soft affinity, skip this step.
+ */
+if ( bs == BALANCE_SOFT_AFFINITY &&
+ !has_soft_affinity(new->vcpu, new->vcpu->cpu_hard_affinity) )
+continue;

-/*
- * If there are no fully idle cores, check all idlers, after
- * having filtered out pcpus that have been tickled but haven't
-  

Re: [Xen-devel] [PATCH 05/24] xen: credit2: make tickling more deterministic

2016-08-31 Thread anshul makkar

On 17/08/16 18:18, Dario Faggioli wrote:

Right now, the following scenario can occurr:
  - upon vcpu v wakeup, v itself is put in the runqueue,
and pcpu X is tickled;
  - pcpu Y schedules (for whatever reason), sees v in
the runqueue and picks it up.

This may seem ok (or even a good thing), but it's not.
In fact, if runq_tickle() decided X is where v should
run, it did it for a reason (load distribution, SMT
support, cache hotness, affinity, etc), and we really
should try as hard as possible to stick to that.

Of course, we can't be too strict, or we risk leaving
vcpus in the runqueue while there is available CPU
capacity. So, we only leave v in runqueue --for X to
pick it up-- if we see that X has been tickled and
has not scheduled yet, i.e., it will have a real chance
of actually select and schedule v.

If that is not the case, we schedule it on Y (or, at
least, we consider that), as running somewhere non-ideal
is better than not running at all.

The commit also adds performance counters for each of
the possible situations.

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>
---
Cc: George Dunlap <george.dun...@citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
Cc: Jan Beulich <jbeul...@suse.com>
Cc: Andrew Cooper <andrew.coop...@citrix.com>
---
  xen/common/sched_credit2.c   |   65 +++---
  xen/include/xen/perfc_defn.h |3 ++
  2 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 12dfd20..a3d7beb 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -54,6 +54,7 @@
  #define TRC_CSCHED2_LOAD_CHECK   TRC_SCHED_CLASS_EVT(CSCHED2, 16)
  #define TRC_CSCHED2_LOAD_BALANCE TRC_SCHED_CLASS_EVT(CSCHED2, 17)
  #define TRC_CSCHED2_PICKED_CPU   TRC_SCHED_CLASS_EVT(CSCHED2, 19)
+#define TRC_CSCHED2_RUNQ_CANDIDATE   TRC_SCHED_CLASS_EVT(CSCHED2, 20)

  /*
   * WARNING: This is still in an experimental phase.  Status and work can be 
found at the
@@ -398,6 +399,7 @@ struct csched2_vcpu {
  int credit;
  s_time_t start_time; /* When we were scheduled (used for credit) */
  unsigned flags;  /* 16 bits doesn't seem to play well with 
clear_bit() */
+int tickled_cpu; /* cpu tickled for picking us up (-1 if none) */

  /* Individual contribution to load */
  s_time_t load_last_update;  /* Last time average was updated */
@@ -1049,6 +1051,10 @@ runq_tickle(const struct scheduler *ops, struct 
csched2_vcpu *new, s_time_t now)
  __cpumask_set_cpu(ipid, >tickled);
  smt_idle_mask_clear(ipid, >smt_idle);
  cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
+
+if ( unlikely(new->tickled_cpu != -1) )
+SCHED_STAT_CRANK(tickled_cpu_overwritten);
+new->tickled_cpu = ipid;
  }

  /*
@@ -1266,6 +1272,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct 
vcpu *vc, void *dd)
  ASSERT(svc->sdom != NULL);
  svc->credit = CSCHED2_CREDIT_INIT;
  svc->weight = svc->sdom->weight;
+svc->tickled_cpu = -1;
  /* Starting load of 50% */
  svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_precision_shift - 1);
  svc->load_last_update = NOW() >> LOADAVG_GRANULARITY_SHIFT;
@@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct 
vcpu *vc, void *dd)
  else
  {
  ASSERT(svc->sdom == NULL);
+svc->tickled_cpu = svc->vcpu->vcpu_id;
If I understood correctly, tickled_cpu refers to pcpu and not a vcpu. 
Saving vcpu_id in tickled_cpu looks wrong.



  svc->credit = CSCHED2_IDLE_CREDIT;
  svc->weight = 0;
  }
@@ -2233,7 +2241,8 @@ void __dump_execstate(void *unused);
  static struct csched2_vcpu *
  runq_candidate(struct csched2_runqueue_data *rqd,
 struct csched2_vcpu *scurr,
-   int cpu, s_time_t now)
+   int cpu, s_time_t now,
+   unsigned int *pos)
  {
  struct list_head *iter;
  struct csched2_vcpu *snext = NULL;
@@ -2262,13 +2271,29 @@ runq_candidate(struct csched2_runqueue_data *rqd,

  /* Only consider vcpus that are allowed to run on this processor. */
  if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
+{
+(*pos)++;
  continue;
+}
+
+/*
+ * If a vcpu is meant to be picked up by another processor, and such
+ * processor has not scheduled yet, leave it in the runqueue for him.
+ */
+if ( svc->tickled_cpu != -1 && svc->tickled_cpu != cpu &&
+ cpumask_test_cpu(svc->tickled_cpu, >tickled) )
+{
+(*pos)++;
+SCHED_STAT_CRANK(deferred_to_tickled_cpu);
+continue;
+}

  /* If this is on a different processor, don't pull it unless
   

Re: [Xen-devel] [PATCH 4/4] tools/xenalyze: Allow automatic resizing of sample buffers

2016-08-08 Thread anshul makkar

On 08/08/16 10:54, George Dunlap wrote:

Rather than have large fixed-size buffers, start with smaller buffers
and allow them to grow as needed (doubling each time), with a fairly
large maximum.  Allow this maximum to be set by a command-line
parameter.

Signed-off-by: George Dunlap <george.dun...@citrix.com>
---
CC: Ian Jackson <ian.jack...@citrix.com>
CC: Wei Liu <wei.l...@citrix.com>
CC: Dario Faggioli <dario.faggi...@citrix.com>
CC: Anshul Makkar <anshul.mak...@citrix.com>
---
  tools/xentrace/xenalyze.c | 95 +--
  1 file changed, 68 insertions(+), 27 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 455cbdf..a4d8823 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -44,7 +44,8 @@ struct mread_ctrl;
  #define QHZ_FROM_HZ(_hz) (((_hz) << 10)/ 10)

  #define ADDR_SPACE_BITS 48
-#define DEFAULT_SAMPLE_SIZE 10240
+#define DEFAULT_SAMPLE_SIZE 1024
+#define DEFAULT_SAMPLE_MAX  1024*1024*32
  #define DEFAULT_INTERVAL_LENGTH 1000




  s->event_count++;

  if (!c)
  return;

  if(opt.sample_size) {
-int lap = (s->count/opt.sample_size)+1,
-index =s->count % opt.sample_size;
-if((index - (lap/3))%lap == 0) {
-if(!s->sample) {
-s->sample = malloc(sizeof(*s->sample) * opt.sample_size);
-if(!s->sample) {
-fprintf(stderr, "%s: malloc failed!\n", __func__);
-error(ERR_SYSTEM, NULL);
-}
+if (s->count >= s->sample_size
+&& (s->count == 0
+|| opt.sample_max == 0
+|| s->sample_size < opt.sample_max)) {
+int new_size;
+void * new_sample = NULL;
+
+new_size = s->sample_size << 1;
Sorry for my ignorance here, but why we have chosen to double the size. 
Can't we increase by fixed size X where X < double size. Are we sure 
that we will be able to fully utilize the double sized buffer.

+
+if (new_size == 0)
+new_size = opt.sample_size;
+
+if (opt.sample_max != 0 && new_size > opt.sample_max)
+new_size = opt.sample_max;
+
+//printf("New size: %d\n", new_size);
+
+new_sample = realloc(s->sample, sizeof(*s->sample) * new_size);
+
+if (new_sample) {
+s->sample = new_sample;
+s->sample_size = new_size;
  }
-s->sample[index]=c;
  }
+
+if (s->count < s->sample_size) {
+s->sample[s->count]=c;
+} else {
+/*
+ * If we run out of space for samples, start taking only a
+ * subset of samples.
+ */
+int lap, index;
+lap = (s->count/s->sample_size)+1;
+index =s->count % s->sample_size;
+if((index - (lap/3))%lap == 0) {
+s->sample[index]=c;
+ }
+ }
  }
  s->count++;







___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH -v3 1/1] ratelimit: Implement rate limit for credit2 scheduler Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue o

2016-08-03 Thread anshul makkar

On 03/08/16 11:16, George Dunlap wrote:

On 26/07/16 17:21, Dario Faggioli wrote:

On Mon, 2016-07-25 at 17:12 +0100, Anshul Makkar wrote:

It introduces context-switch rate-limiting. The patch enables the VM

The subject, which will become the "title" of the commit, is way too
long. That must be a very concise headline of what the patch is about,
and should also have tags, specifying what areas of the codebase are
affected. So what do you think of this:

   xen: credit2: implement context switch rate-limiting.


It looks like it's just missing a carrage return -- I could fix that up
on check-in.



On a more technical side, I think that...

+if ( prv->ratelimit_us )
+{
+s_time_t ratelimit_min = prv->ratelimit_us;


... this should be:

  s_time_t ratelimit_min = MICROSECS(prv->ratelimit_us);

I realized that by looking at traces and seeing entries for which
CSCHED2_MIN_TIMER was being returned, even if I had set
sched_ratelimit_us to a value greater than that.


Yes, Dario is correct here.

There's also a small typo in one of the comments ("onw" instead of "own").

With all those changed:

Reviewed-by: George Dunlap <george.dun...@citrix.com>

If Anshul and Dario are happy, I can fix all those thing up on commit.


Fine with me.

-George


Anshul


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH -v3 1/1] ratelimit: Implement rate limit for credit2 scheduler Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue or be

2016-07-25 Thread Anshul Makkar
It introduces context-switch rate-limiting. The patch enables the VM to batch
its work and prevents the system from spending most of its time in context
switches because of a VM that is waking/sleeping at high rate.

ratelimit can be disabled by setting it to 0.

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
 Changes in v3:
 * General fixes based on the review comments from v1 and v2.
---
 xen/common/sched_credit2.c | 111 ++---
 1 file changed, 85 insertions(+), 26 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index b92226c..d9f39dc 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -377,6 +377,7 @@ struct csched2_private {
 
 unsigned int load_precision_shift;
 unsigned int load_window_shift;
+unsigned ratelimit_us; /* each cpupool can have its onw ratelimit */
 };
 
 /*
@@ -2029,6 +2030,34 @@ csched2_dom_cntl(
 return rc;
 }
 
+static int csched2_sys_cntl(const struct scheduler *ops,
+struct xen_sysctl_scheduler_op *sc)
+{
+int rc = -EINVAL;
+xen_sysctl_credit_schedule_t *params = >u.sched_credit;
+struct csched2_private *prv = CSCHED2_PRIV(ops);
+unsigned long flags;
+
+switch (sc->cmd )
+{
+case XEN_SYSCTL_SCHEDOP_putinfo:
+if ( params->ratelimit_us &&
+( params->ratelimit_us > XEN_SYSCTL_SCHED_RATELIMIT_MAX ||
+  params->ratelimit_us < XEN_SYSCTL_SCHED_RATELIMIT_MIN ))
+return rc;
+write_lock_irqsave(>lock, flags);
+prv->ratelimit_us = params->ratelimit_us;
+write_unlock_irqrestore(>lock, flags);
+break;
+
+case XEN_SYSCTL_SCHEDOP_getinfo:
+params->ratelimit_us = prv->ratelimit_us;
+rc = 0;
+break;
+}
+return rc;
+}
+
 static void *
 csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
 {
@@ -2098,12 +2127,14 @@ csched2_dom_destroy(const struct scheduler *ops, struct 
domain *dom)
 
 /* How long should we let this vcpu run for? */
 static s_time_t
-csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu 
*snext)
+csched2_runtime(const struct scheduler *ops, int cpu,
+struct csched2_vcpu *snext, s_time_t now)
 {
-s_time_t time; 
+s_time_t time, min_time;
 int rt_credit; /* Proposed runtime measured in credits */
 struct csched2_runqueue_data *rqd = RQD(ops, cpu);
 struct list_head *runq = >runq;
+struct csched2_private *prv = CSCHED2_PRIV(ops);
 
 /*
  * If we're idle, just stay so. Others (or external events)
@@ -2116,9 +2147,22 @@ csched2_runtime(const struct scheduler *ops, int cpu, 
struct csched2_vcpu *snext
  * 1) Run until snext's credit will be 0
  * 2) But if someone is waiting, run until snext's credit is equal
  * to his
- * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER.
+ * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER or
+ * the ratelimit time.
  */
 
+/* Calculate mintime */
+min_time = CSCHED2_MIN_TIMER;
+if ( prv->ratelimit_us )
+{
+s_time_t ratelimit_min = prv->ratelimit_us;
+if ( snext->vcpu->is_running )
+ratelimit_min = snext->vcpu->runstate.state_entry_time +
+MICROSECS(prv->ratelimit_us) - now;
+if ( ratelimit_min > min_time )
+min_time = ratelimit_min;
+}
+
 /* 1) Basic time: Run until credit is 0. */
 rt_credit = snext->credit;
 
@@ -2135,32 +2179,32 @@ csched2_runtime(const struct scheduler *ops, int cpu, 
struct csched2_vcpu *snext
 }
 }
 
-/* The next guy may actually have a higher credit, if we've tried to
- * avoid migrating him from a different cpu.  DTRT.  */
-if ( rt_credit <= 0 )
+/*
+ * The next guy on the runqueue may actually have a higher credit,
+ * if we've tried to avoid migrating him from a different cpu.
+ * Setting time=0 will ensure the minimum timeslice is chosen.
+ *
+ * FIXME: See if we can eliminate this conversion if we know time
+ * will be outside (MIN,MAX).  Probably requires pre-calculating
+ * credit values of MIN,MAX per vcpu, since each vcpu burns credit
+ * at a different rate.
+ */
+if (rt_credit > 0)
+time = c2t(rqd, rt_credit, snext);
+else
+time = 0;
+
+/* 3) But never run longer than MAX_TIMER or less than MIN_TIMER or
+ * the rate_limit time. */
+if ( time < min_time)
 {
-time = CSCHED2_MIN_TIMER;
+time = min_time;
 SCHED_STAT_CRANK(runtime_min_timer);
 }
-else
+else if (time > CSCHED2_MAX_TIMER)
 {
-/* FIXME: See if we can eliminate this conversion if we know time
- * will be outside (MIN,MAX

Re: [Xen-devel] [PATCH v2 1/1] ratelimit: Implement rate limit for credit2 scheduler Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue or

2016-07-25 Thread anshul makkar

On 22/07/16 11:36, Dario Faggioli wrote:

On Mon, 2016-07-18 at 13:22 +0100, Anshul Makkar wrote:



Hey, Anshul.

Thanks, and sorry for the delay in reviewing.

This version is an improvement, but it looks to me that you've missed a
few of the review comments to v1.

Sorry about that. !!

It introduces a minimum amount of latency


"introduces context-switch rate-limiting"

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 8b95a47..68bcdb8 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c

@@ -1601,6 +1602,34 @@ csched2_dom_cntl(
+switch (sc->cmd )
+{
+case XEN_SYSCTL_SCHEDOP_putinfo:
+if ( params->ratelimit_us &&
+( params->ratelimit_us < CSCHED2_MIN_TIMER ||
+  params->ratelimit_us >

I remember saying already (although, it may have be in pvt, not on this
list) that I think we should just use XEN_SYSCTL_SCHED_RATELIMIT_MAX
and XEN_SYSCTL_SCHED_RATELIMIT_MIN here.

CSCHED2_MIN_TIMER and CSCHED2_MAX_TIMER are internal implementation
details, and I don't like them exposed (although, indirectly) to the
user.

addressed.

+return rc;
+spin_lock_irqsave(>lock, flags);
+

This is ok. However, the code base changed in the meanwhile (sorry! :-
P), and this spin_lock_irqsave() needs to become a
write_lock_irqsave().

done.


Mmm... if you wanted to implement my suggestion from
<1468400021.13039.33.ca...@citrix.com>, you're definitely missing
something:

  s_time_t ratelimit_min = prv->ratelimit_us;
  if ( snext->vcpu->is_running )
  ratelimit_min = snext->vcpu->runstate.state_entry_time +
  MICROSECS(prv->ratelimit_us) - now;


yes, missed the if part for checking if the vcpu is currently running.

In fact, you're initializing ratelimit_min and then immediately
overriding that... I'm surprised the compiler didn't complain.


+if ( ratelimit_min > min_time )
+min_time = ratelimit_min;
+}
+



@@ -1707,32 +1749,33 @@ csched2_runtime(const struct scheduler *ops,
int cpu, struct csched2_vcpu *snext
  }
  }




@@ -1746,7 +1789,7 @@ void __dump_execstate(void *unused);
  static struct csched2_vcpu *
  runq_candidate(struct csched2_runqueue_data *rqd,
 struct csched2_vcpu *scurr,
-   int cpu, s_time_t now)
+   int cpu, s_time_t now, struct csched2_private *prv)


Reviewing v1, George said this:

   Since we have the cpu, I think we can get ops this way, without
   cluttering things up with the extra argument:

   struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));

yes, missed that change too. Addressed in v3.



@@ -1775,9 +1829,13 @@ runq_candidate(struct csched2_runqueue_data
*rqd,
  }

  /* If the next one on the list has more credit than current
- * (or idle, if current is not runnable), choose it. */
+ * (or idle, if current is not runnable) and current one has
already
+ * executed for more than ratelimit. choose it.
+ * Control has reached here means that current vcpu has
executed >
+ * ratelimit_us or ratelimit is off, so chose the next one.
+ */
  if ( svc->credit > snext->credit )
-snext = svc;
+snext = svc;


Both me and George agreed that changing the comment like this is not
helping much and should not be done.
Though, I find the extended comment useful, but if you don't agree I 
will remove it v3.




Regards,
Dario


Anshul



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] XSM-docs: Flask operates on domain types and not on individual domain. Updated the documentation to reflect this.

2016-07-19 Thread anshul makkar

On 19/07/16 14:30, Doug Goldstein wrote:

On 7/19/16 4:05 AM, Anshul Makkar wrote:

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
  * Resending the patch due to incomplete subject in the previous patch.

  docs/misc/xsm-flask.txt | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)
---
diff --git a/docs/misc/xsm-flask.txt b/docs/misc/xsm-flask.txt
index 62f15dd..bf8bb6e 100644
  Some examples of what FLASK can do:
- - Prevent two domains from communicating via event channels or grants
- - Control which domains can use device passthrough (and which devices)
+ - Prevent two domains types from communicating via event channels or grants
+ - Control which type of domains can use device passthrough (and which devices)


I disagree with this snippet. This is an example of what you can do with
FLASK. You can use flask to do those two actions. Adding the word
"types" in there takes it from being a concrete example to being more
ambiguous.
"Prevent domains belonging to different types to communicate via event 
channels or grants". Does this sounds better.


I think that its important to use the word "type" so that user doesn't 
get a wrong impression that the policy is per domain, while in actual 
its per type.



   - Restrict or audit operations performed by privileged domains
   - Prevent a privileged domain from arbitrarily mapping pages from other 
domains

@@ -160,10 +160,10 @@ the policy can be reloaded using "xl loadpolicy".
  The example policy included with Xen demonstrates most of the features of 
FLASK
  that can be used without dom0 disaggregation. The main types for domUs are:

- - domU_t is a domain that can communicate with any other domU_t
+ - domU_t is a domain type that can communicate with any other domU_t types.


"A domain labeled with domU_t can communicate with any other domain
labeled with type domU_t."

Rephrasing is fine.



   - isolated_domU_t can only communicate with dom0
   - prot_domU_t is a domain type whose creation can be disabled with a boolean
- - nomigrate_t is a domain that must be created via the nomigrate_t_building
+ - nomigrate_t is a domain type that must be created via the 
nomigrate_t_building
 type, and whose memory cannot be read by dom0 once created


"A domain labeled with nomigeate_t is a domain that"

Rephrasing is fine.




  HVM domains with stubdomain device models also need a type for the stub 
domain.







___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] XSM-docs: Flask operates on domain types and not on individual domain. Updated the documentation to reflect this.

2016-07-19 Thread Anshul Makkar
Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
 * Resending the patch due to incomplete subject in the previous patch.

 docs/misc/xsm-flask.txt | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
---
diff --git a/docs/misc/xsm-flask.txt b/docs/misc/xsm-flask.txt
index 62f15dd..bf8bb6e 100644
--- a/docs/misc/xsm-flask.txt
+++ b/docs/misc/xsm-flask.txt
@@ -9,8 +9,8 @@ controls over Xen domains, allowing the policy writer to define 
what
 interactions between domains, devices, and the hypervisor are permitted.
 
 Some examples of what FLASK can do:
- - Prevent two domains from communicating via event channels or grants
- - Control which domains can use device passthrough (and which devices)
+ - Prevent two domains types from communicating via event channels or grants
+ - Control which type of domains can use device passthrough (and which devices)
  - Restrict or audit operations performed by privileged domains
  - Prevent a privileged domain from arbitrarily mapping pages from other 
domains
 
@@ -160,10 +160,10 @@ the policy can be reloaded using "xl loadpolicy".
 The example policy included with Xen demonstrates most of the features of FLASK
 that can be used without dom0 disaggregation. The main types for domUs are:
 
- - domU_t is a domain that can communicate with any other domU_t
+ - domU_t is a domain type that can communicate with any other domU_t types.
  - isolated_domU_t can only communicate with dom0
  - prot_domU_t is a domain type whose creation can be disabled with a boolean
- - nomigrate_t is a domain that must be created via the nomigrate_t_building
+ - nomigrate_t is a domain type that must be created via the 
nomigrate_t_building
type, and whose memory cannot be read by dom0 once created
 
 HVM domains with stubdomain device models also need a type for the stub domain.
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] XSM-docs: Flask operates on domain types and not on

2016-07-19 Thread Anshul Makkar
Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
 docs/misc/xsm-flask.txt | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/misc/xsm-flask.txt b/docs/misc/xsm-flask.txt
index 62f15dd..bf8bb6e 100644
--- a/docs/misc/xsm-flask.txt
+++ b/docs/misc/xsm-flask.txt
@@ -9,8 +9,8 @@ controls over Xen domains, allowing the policy writer to define 
what
 interactions between domains, devices, and the hypervisor are permitted.
 
 Some examples of what FLASK can do:
- - Prevent two domains from communicating via event channels or grants
- - Control which domains can use device passthrough (and which devices)
+ - Prevent two domains types from communicating via event channels or grants
+ - Control which type of domains can use device passthrough (and which devices)
  - Restrict or audit operations performed by privileged domains
  - Prevent a privileged domain from arbitrarily mapping pages from other 
domains
 
@@ -160,10 +160,10 @@ the policy can be reloaded using "xl loadpolicy".
 The example policy included with Xen demonstrates most of the features of FLASK
 that can be used without dom0 disaggregation. The main types for domUs are:
 
- - domU_t is a domain that can communicate with any other domU_t
+ - domU_t is a domain type that can communicate with any other domU_t types.
  - isolated_domU_t can only communicate with dom0
  - prot_domU_t is a domain type whose creation can be disabled with a boolean
- - nomigrate_t is a domain that must be created via the nomigrate_t_building
+ - nomigrate_t is a domain type that must be created via the 
nomigrate_t_building
type, and whose memory cannot be read by dom0 once created
 
 HVM domains with stubdomain device models also need a type for the stub domain.
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 1/1] ratelimit: Implement rate limit for credit2 scheduler Rate limit assures that a vcpu will execute for a minimum amount of time before being put at the back of a queue or bei

2016-07-18 Thread Anshul Makkar
It introduces a minimum amount of latency to enable a VM to batch its work and
it also ensures that system is not spending most of its time in
VMEXIT/VMENTRY because of VM that is waking/sleeping at high rate.

ratelimit can be disabled by setting it to 0.

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
Changes in v2:
 * algo for time slice calculation based on ratelimit_us has changed.
 * initial value of prv->ratelimit_us.
 * other changes based on review comments.
---
 xen/common/sched_credit2.c | 124 +
 1 file changed, 93 insertions(+), 31 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 8b95a47..68bcdb8 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -280,6 +280,7 @@ struct csched2_private {
 struct csched2_runqueue_data rqd[NR_CPUS];
 
 unsigned int load_window_shift;
+unsigned ratelimit_us; /* each cpupool can have its onw ratelimit */
 };
 
 /*
@@ -1601,6 +1602,34 @@ csched2_dom_cntl(
 return rc;
 }
 
+static int csched2_sys_cntl(const struct scheduler *ops,
+struct xen_sysctl_scheduler_op *sc)
+{
+int rc = -EINVAL;
+xen_sysctl_credit_schedule_t *params = >u.sched_credit;
+struct csched2_private *prv = CSCHED2_PRIV(ops);
+unsigned long flags;
+
+switch (sc->cmd )
+{
+case XEN_SYSCTL_SCHEDOP_putinfo:
+if ( params->ratelimit_us &&
+( params->ratelimit_us < CSCHED2_MIN_TIMER ||
+  params->ratelimit_us > MICROSECS(CSCHED2_MAX_TIMER) ))
+return rc;
+spin_lock_irqsave(>lock, flags);
+prv->ratelimit_us = params->ratelimit_us;
+spin_unlock_irqrestore(>lock, flags);
+break;
+
+case XEN_SYSCTL_SCHEDOP_getinfo:
+params->ratelimit_us = prv->ratelimit_us;
+rc = 0;
+break;
+}
+return rc;
+}
+
 static void *
 csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
 {
@@ -1670,12 +1699,14 @@ csched2_dom_destroy(const struct scheduler *ops, struct 
domain *dom)
 
 /* How long should we let this vcpu run for? */
 static s_time_t
-csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu 
*snext)
+csched2_runtime(const struct scheduler *ops, int cpu,
+struct csched2_vcpu *snext, s_time_t now)
 {
-s_time_t time; 
+s_time_t time, min_time;
 int rt_credit; /* Proposed runtime measured in credits */
 struct csched2_runqueue_data *rqd = RQD(ops, cpu);
 struct list_head *runq = >runq;
+struct csched2_private *prv = CSCHED2_PRIV(ops);
 
 /*
  * If we're idle, just stay so. Others (or external events)
@@ -1688,9 +1719,20 @@ csched2_runtime(const struct scheduler *ops, int cpu, 
struct csched2_vcpu *snext
  * 1) Run until snext's credit will be 0
  * 2) But if someone is waiting, run until snext's credit is equal
  * to his
- * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER.
+ * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER or
+ * run for ratelimit time.
  */
 
+/* Calculate mintime */
+min_time = CSCHED2_MIN_TIMER;
+if ( prv->ratelimit_us ) {
+s_time_t ratelimit_min = prv->ratelimit_us;
+ratelimit_min = snext->vcpu->runstate.state_entry_time +
+MICROSECS(prv->ratelimit_us) - now;
+if ( ratelimit_min > min_time )
+min_time = ratelimit_min;
+}
+
 /* 1) Basic time: Run until credit is 0. */
 rt_credit = snext->credit;
 
@@ -1707,32 +1749,33 @@ csched2_runtime(const struct scheduler *ops, int cpu, 
struct csched2_vcpu *snext
 }
 }
 
-/* The next guy may actually have a higher credit, if we've tried to
- * avoid migrating him from a different cpu.  DTRT.  */
-if ( rt_credit <= 0 )
+/*
+ * The next guy ont the runqueue may actually have a higher credit,
+ * if we've tried to avoid migrating him from a different cpu.
+ * Setting time=0 will ensure the minimum timeslice is chosen.
+ * FIXME: See if we can eliminate this conversion if we know time
+ * will be outside (MIN,MAX).  Probably requires pre-calculating
+ * credit values of MIN,MAX per vcpu, since each vcpu burns credit
+ * at a different rate.
+ */
+if (rt_credit > 0)
+time = c2t(rqd, rt_credit, snext);
+else
+time = 0;
+
+/*
+ * Never run longer than MAX_TIMER or less than MIN_TIMER or for
+ * rate_limit time.
+ */
+if ( time < min_time)
 {
-time = CSCHED2_MIN_TIMER;
-SCHED_STAT_CRANK(runtime_min_timer);
+   time = min_time;
+   SCHED_STAT_CRANK(runtime_min_timer);
 }
-else
+else if (time > CSCHED2_MAX_TIMER)
 {
-/* FIXME: See if we can eliminate this conversi

Re: [Xen-devel] [PATCH] credi2-ratelimit: Implement rate limit for credit2 scheduler

2016-07-18 Thread anshul makkar

On 13/07/16 09:53, Dario Faggioli wrote:

On Tue, 2016-07-12 at 17:16 +0100, George Dunlap wrote:

On Wed, Jul 6, 2016 at 6:33 PM, Makkar anshul.mak...@citrix.com
 wrote:


--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
+#define MAX_TSLICE(t1, t2)  \
+   ({ typeof (t1) _t1 = (t1); \
+  typeof (t1) _t2 = (t2); \
+  _t1 > _t2 ? _t1 < 0 ? 0 : _t1 : _t2 < 0 ? 0
: _t2; })

Hiding the zero-checking behind this seems a bit strange to me.  I
think if the algorithm is properly laid out, we probably don't need
this at all (see my proposed alternate below).


I think it's more "overflow-avoidance" than "zero-checking", and as I
said, that's probably better done by means of subtraction(s).

In any case, I agree that, if we don't need it, that's even better. :-)

Agreed. I am going to remove it. Its not needed with modified algo.





This is all really confusing.  What about something like this (also
attached)?  The basic idea is to calculate min_time, then go through
the normal algorithm, then clip it once at the end.


Yes, this matches my idea and my comment, and I think the code provided
by George makes sense. Only one thing:


@@ -1675,9 +1711,19 @@ csched2_runtime(const struct scheduler *ops,
int cpu, struct csched2_vcpu *snext
+ * or your the ratelimit time.
   */
+/* Calculate mintime */
+min_time = CSCHED2_MIN_TIMER;
+if ( prv->ratelimit_us ) {
+s_time_t ratelimit_min = snext->vcpu-

runstate.state_entry_time +

+MICROSECS(prv->ratelimit_us) - now;


Here snext can indeed be someone which was running already (e.g., we're
choosing current again), in which case runstate.state_entry-time-now
would indeed tell us how long it's actually been running, and the
formula (coupled with the if below) is correct.

But it also can be someone which is runnable (e.g., we're choosing
someone from the runqueue and preempting current), in which case
runstate.state_entry_time tells when it became runnable, and
state_entry_time-now is how long it's been runnable, which is not what
we want here.

In think, in such a case, we want ratelimit_min to just be equal to
prv->ratelimit_us. So, maybe, something like this:

  /* Caluclate mintime */
  min_time = CSCHED2_MIN_TIMER;
  if ( prv->ratelimit_us )
  {
  s_time_t ratelimit_min = prv->ratelimit_us;
  if ( snext->vcpu->is_running ) // XXX or is it better snext == 
curr_on_cpu(cpu)
  ratelimit_min = snext->vcpu->runstate.state_entry_time +
  MICROSECS(prv->ratelimit_us) - now;
  if ( ratelimit_min > min_time )
  min_time = ratelimit_min;
  }


Agreed.

@@ -2353,6 +2431,8 @@ csched2_init(struct scheduler *ops)
  prv->runq_map[i] = -1;
  prv->rqd[i].id = -1;
  }
+/* initialize ratelimit */
+prv->ratelimit_us = 1000 * CSCHED2_MIN_TIMER;

Is there any reason this isn't using sched_ratelimit_us (the
hypervisor command-line option, which the credit scheduler is using)?


Yeah, I guess just using that is the best thing to start with.

Agreed.


Regards,
Dario


Will post the reworked patch.

Anshul


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2] XSM-Policy: allow source domain access to setpodtarget and getpodtarget for ballooning.

2016-07-14 Thread Anshul Makkar
Access to setpodtarget and getpodtarget is required by dom0 to set the balloon
targets for domU. The patch gives source domain (dom0) access to set
this target for domU and resolve the following permission denied erro
message during ballooning :
avc:  denied  { setpodtarget } for domid=0 target=9
scontext=system_u:system_r:dom0_t
tcontext=system_u:system_r:domU_t tclass=domain

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
Acked-by: Daniel De Graaf <dgde...@tycho.nsa.gov>
---
Changed Since V1:
 * added getpodtarget.

 tools/flask/policy/modules/xen.if | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/flask/policy/modules/xen.if 
b/tools/flask/policy/modules/xen.if
index 8c43c28..dbefa1e 100644
--- a/tools/flask/policy/modules/xen.if
+++ b/tools/flask/policy/modules/xen.if
@@ -83,7 +83,8 @@ define(`create_domain_build_label', `
 define(`manage_domain', `
allow $1 $2:domain { getdomaininfo getvcpuinfo getaffinity
getaddrsize pause unpause trigger shutdown destroy
-   setaffinity setdomainmaxmem getscheduler resume };
+   setaffinity setdomainmaxmem getscheduler resume
+   setpodtarget getpodtarget };
 allow $1 $2:domain2 set_vnumainfo;
 ')
 
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] XSM-Policy: allow source domain access to setpodtarget for ballooning.

2016-07-13 Thread Anshul Makkar
Access to setpodtarget is required by dom0 to set the balloon targets for
domU. The patch gives source domain (dom0) access to set this target for
domU and resolve the following permission denied error message during
ballooning :
avc:  denied  { setpodtarget } for domid=0 target=9
scontext=system_u:system_r:dom0_t
tcontext=system_u:system_r:domU_t tclass=domain

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
---
 tools/flask/policy/modules/xen.if | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/flask/policy/modules/xen.if 
b/tools/flask/policy/modules/xen.if
index 8c43c28..8ae3c2e 100644
--- a/tools/flask/policy/modules/xen.if
+++ b/tools/flask/policy/modules/xen.if
@@ -83,7 +83,8 @@ define(`create_domain_build_label', `
 define(`manage_domain', `
allow $1 $2:domain { getdomaininfo getvcpuinfo getaffinity
getaddrsize pause unpause trigger shutdown destroy
-   setaffinity setdomainmaxmem getscheduler resume };
+   setaffinity setdomainmaxmem getscheduler resume
+   setpodtarget };
 allow $1 $2:domain2 set_vnumainfo;
 ')
 
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] default XSM policy for PCI passthrough for unlabeled resources.

2016-07-07 Thread anshul makkar

On 07/07/16 16:36, Daniel De Graaf wrote:

On 07/06/2016 12:19 PM, anshul makkar wrote:

On 06/07/16 16:59, Daniel De Graaf wrote:

On 07/06/2016 11:34 AM, anshul makkar wrote:

Hi,


It allows the resource to be added and removed by the source domain to
target domain, but its use by target domain is blocked.


This rule only mandates the use of resource_type for resource types.  If
you are creating a new resource type, follow the example in nic_dev.te.

Agreed, but inherently it means that "use" of any unlabeled resource
be it irq, ioport or iomem or nic_dev is restricted.


Restricted how?  The fallback types have the resource_type attribute.

Restricted if they are unlabeled.


Neverallow rules are actually not present in the binary policy; they act as
compile-time assertions in the policy build.


Fine.



The resource can be used only if it has been labeled using
flask-label-pci command which needs to be rerun after every boot and
after every policy reload.



Try adding a module with the following rules, which should allow domU to
use unlabeled devices:

use_device(domU_t, irq_t)
use_device(domU_t, ioport_t)
use_device(domU_t, iomem_t)
use_device(domU_t, device_t)

Yes, it does work , but I have added these in delegate_device to make
it restrict to the case where there is delegation.


This prevents using delegate_devices without allowing access to unlabeled
devices.  If you think this should be a macro, I would suggest making a new
one named something like "delegate_unlabeled_devices".



Agreed. That's a better approach.
I believe this macro can make the default policy more flexible and 
useful for more general audience, so it should be there in the policy. I 
can submit patch for the same. Your thoughts ?


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] XSM/policy: Allow the source domain access to settime and setdomainhandle domctls while creating domain.

2016-07-07 Thread Anshul Makkar anshul.makkar
From: Anshul Makkar <anshul.mak...@citrix.com>

This patch resolves the following permission denied scenarios while creating
new domU :
avc:  denied  { setdomainhandle } for domid=0 target=1
scontext=system_u:system_r:dom0_t tcontext=system_u:system_r:domU_t 
tclass=domain

avc:  denied  { settime } for domid=0 target=1 scontext=system_u:system_r:dom0_t
tcontext=system_u:system_r:domU_t tclass=domain

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
 tools/flask/policy/modules/xen.if | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/flask/policy/modules/xen.if 
b/tools/flask/policy/modules/xen.if
index fd96303..8c43c28 100644
--- a/tools/flask/policy/modules/xen.if
+++ b/tools/flask/policy/modules/xen.if
@@ -48,7 +48,8 @@ define(`declare_build_label', `
 define(`create_domain_common', `
allow $1 $2:domain { create max_vcpus setdomainmaxmem setaddrsize
getdomaininfo hypercall setvcpucontext getscheduler
-   getvcpuinfo getaddrsize getaffinity setaffinity };
+   getvcpuinfo getaddrsize getaffinity setaffinity
+   settime setdomainhandle };
allow $1 $2:domain2 { set_cpuid settsc setscheduler setclaim
set_max_evtchn set_vnumainfo get_vnumainfo cacheflush
psr_cmt_op psr_cat_op soft_reset };
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] credi2-ratelimit: Implement rate limit for credit2 scheduler

2016-07-06 Thread Anshul Makkar anshul.makkar
From: Anshul Makkar <anshul.mak...@citrix.com>

Rate limit assures that a vcpu will execute for a minimum amount of time before
being put at the back of a queue or being preempted by higher priority thread.

It introduces a minimum amount of latency to enable a VM to batch its work and
it also ensures that system is not spending most of its time in
VMEXIT/VMENTRY because of VM that is waking/sleeping at high rate.

ratelimit can be disabled by setting it to 0.

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
---
 xen/common/sched_credit2.c | 115 ++---
 1 file changed, 98 insertions(+), 17 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 1933ff1..6718574 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -171,6 +171,11 @@ integer_param("sched_credit2_migrate_resist", 
opt_migrate_resist);
 #define c2r(_ops, _cpu) (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
 /* CPU to runqueue struct macro */
 #define RQD(_ops, _cpu) (_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
+/* Find the max of time slice */
+#define MAX_TSLICE(t1, t2)  \
+   ({ typeof (t1) _t1 = (t1); \
+  typeof (t1) _t2 = (t2); \
+  _t1 > _t2 ? _t1 < 0 ? 0 : _t1 : _t2 < 0 ? 0 : _t2; })
 
 /*
  * Shifts for load average.
@@ -280,6 +285,7 @@ struct csched2_private {
 struct csched2_runqueue_data rqd[NR_CPUS];
 
 unsigned int load_window_shift;
+unsigned ratelimit_us; /* each cpupool can have its onw ratelimit */
 };
 
 /*
@@ -1588,6 +1594,34 @@ csched2_dom_cntl(
 return rc;
 }
 
+static int csched2_sys_cntl(const struct scheduler *ops,
+struct xen_sysctl_scheduler_op *sc)
+{
+int rc = -EINVAL;
+xen_sysctl_credit_schedule_t *params = >u.sched_credit;
+struct csched2_private *prv = CSCHED2_PRIV(ops);
+unsigned long flags;
+
+switch (sc->cmd )
+{
+case XEN_SYSCTL_SCHEDOP_putinfo:
+if ( params->ratelimit_us &&
+( params->ratelimit_us < CSCHED2_MIN_TIMER ||
+  params->ratelimit_us > MICROSECS(CSCHED2_MAX_TIMER) ))
+return rc;
+spin_lock_irqsave(>lock, flags);
+prv->ratelimit_us = params->ratelimit_us;
+spin_unlock_irqrestore(>lock, flags);
+break;
+
+case XEN_SYSCTL_SCHEDOP_getinfo:
+params->ratelimit_us = prv->ratelimit_us;
+rc = 0;
+break;
+}
+return rc;
+}
+
 static void *
 csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
 {
@@ -1657,12 +1691,15 @@ csched2_dom_destroy(const struct scheduler *ops, struct 
domain *dom)
 
 /* How long should we let this vcpu run for? */
 static s_time_t
-csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu 
*snext)
+csched2_runtime(const struct scheduler *ops, int cpu,
+struct csched2_vcpu *snext, s_time_t now)
 {
-s_time_t time; 
+s_time_t time;
 int rt_credit; /* Proposed runtime measured in credits */
 struct csched2_runqueue_data *rqd = RQD(ops, cpu);
 struct list_head *runq = >runq;
+s_time_t runtime = 0;
+struct csched2_private *prv = CSCHED2_PRIV(ops);
 
 /*
  * If we're idle, just stay so. Others (or external events)
@@ -1680,6 +1717,14 @@ csched2_runtime(const struct scheduler *ops, int cpu, 
struct csched2_vcpu *snext
 
 /* 1) Basic time: Run until credit is 0. */
 rt_credit = snext->credit;
+if (snext->vcpu->is_running)
+runtime = now - snext->vcpu->runstate.state_entry_time;
+if ( runtime < 0 )
+{
+runtime = 0;
+d2printk("%s: Time went backwards? now %"PRI_stime" state_entry_time 
%"PRI_stime"\n",
+  _func__, now, snext->runstate.state_entry_time);
+}
 
 /* 2) If there's someone waiting whose credit is positive,
  * run until your credit ~= his */
@@ -1695,11 +1740,24 @@ csched2_runtime(const struct scheduler *ops, int cpu, 
struct csched2_vcpu *snext
 }
 
 /* The next guy may actually have a higher credit, if we've tried to
- * avoid migrating him from a different cpu.  DTRT.  */
+ * avoid migrating him from a different cpu.  DTRT.
+ * Even if the next guy has higher credit and current vcpu has executed
+ * for less amount of time than rate limit, allow it to run for minimum
+ * amount of time.
+ */
 if ( rt_credit <= 0 )
 {
-time = CSCHED2_MIN_TIMER;
-SCHED_STAT_CRANK(runtime_min_timer);
+if ( snext->vcpu->is_running && prv->ratelimit_us)
+   /* implies the current one has executed for time < ratelimit and 
thus
+* it has neen selcted int runq_candidate to run next.
+* No need to check f

Re: [Xen-devel] default XSM policy for PCI passthrough for unlabeled resources.

2016-07-06 Thread anshul makkar

On 06/07/16 16:59, Daniel De Graaf wrote:

On 07/06/2016 11:34 AM, anshul makkar wrote:

Hi,


It allows the resource to be added and removed by the source domain to
target domain, but its use by target domain is blocked.


This rule only mandates the use of resource_type for resource types.  If
you are creating a new resource type, follow the example in nic_dev.te.
Agreed, but inherently it means that "use" of any unlabeled resource be 
it irq, ioport or iomem or nic_dev is restricted.



The resource can be used only if it has been labeled using
flask-label-pci command which needs to be rerun after every boot and
after every policy reload.


Yes; this gives the most control over what resources can be delegated.
Policy reloads are supposed to be rare (on a production system) and you
already need special boot scripts (or parameters) to set up the device
for passthrough, so this can be added there.  However, I agree this can
be more work than a "default" FLASK policy should require.

Try adding a module with the following rules, which should allow domU to
use unlabeled devices:

use_device(domU_t, irq_t)
use_device(domU_t, ioport_t)
use_device(domU_t, iomem_t)
use_device(domU_t, device_t)
Yes, it does work , but I have added these in delegate_device to make it 
restrict to the case where there is delegation.


If this works, that module could be added to the default policy.


Given that what we ship is just a sample default policy for reference
which is expected to be permissive in most of the scenarios so that it
doesn't affect the basic functionalities, is this "neverallow" rule
needed ?

Thanks
Anshul Makkar


The neverallow rules are just there to ensure that the attributes are
being used correctly.


Anshul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] default XSM policy for PCI passthrough for unlabeled resources.

2016-07-06 Thread anshul makkar

Hi,

Default XSM policy doesn't allow the use of unlabeled PCI resources that 
have been passed through to target domain.


xen.te
# Resources must be declared using . resource_type 


neverallow * ~resource_type:resource use;

It allows the resource to be added and removed by the source domain to 
target domain, but its use by target domain is blocked.


The resource can be used only if it has been labeled using 
flask-label-pci command which needs to be rerun after every boot and 
after every policy reload.


The above approach reduces the flexibility and necessitates admin 
intervention to give passthrough rights after host has booted. Also, in 
general if I want to allow all domUs to have PCI passthough access to 
all PCI device, I have no other way apart from disabling the 
"neverallow" rule.


Given that what we ship is just a sample default policy for reference 
which is expected to be permissive in most of the scenarios so that it 
doesn't affect the basic functionalities, is this "neverallow" rule needed ?


Thanks
Anshul Makkar









___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement

2016-06-27 Thread anshul makkar

On 18/06/16 00:13, Dario Faggioli wrote:

In fact, right now, we recommend keepeing runqueues
arranged per-core, so that it is the inter-runqueue load
balancing code that automatically spreads the work in an
SMT friendly way. This means that any other runq
arrangement one may want to use falls short of SMT
scheduling optimizations.

This commit implements SMT awareness --similar to the
one we have in Credit1-- for any possible runq
arrangement. This turned out to be pretty easy to do,
as the logic can live entirely in runq_tickle()
(although, in order to avoid for_each_cpu loops in
that function, we use a new cpumask which indeed needs
to be updated in other places).

In addition to disentangling SMT awareness from load
balancing, this also allows us to support the
sched_smt_power_savings parametar in Credit2 as well.

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>


Reviewed-by: Anshul Makkar <anshul.mak...@citrix.com>

---
Cc: George Dunlap <george.dun...@citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
Cc: David Vrabel <david.vra...@citrix.com>
---
  xen/common/sched_credit2.c |  141 +++-
  1 file changed, 126 insertions(+), 15 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 93943fa..a8b3a85 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -351,7 +351,8 @@ struct csched2_runqueue_data {
  unsigned int max_weight;

  cpumask_t idle,/* Currently idle */
-tickled;   /* Another cpu in the queue is already targeted for 
this one */
+smt_idle,  /* Fully idle cores (as in all the siblings are 
idle) */
+tickled;   /* Have been asked to go through schedule */
  int load;  /* Instantaneous load: Length of queue  + num 
non-idle threads */
  s_time_t load_last_update;  /* Last time average was updated */
  s_time_t avgload;   /* Decaying queue load */
@@ -412,6 +413,73 @@ struct csched2_dom {
  };

  /*
+ * Hyperthreading (SMT) support.
+ *
+ * We use a special per-runq mask (smt_idle) and update it according to the
+ * following logic:
+ *  - when _all_ the SMT sibling in a core are idle, all their corresponding
+ *bits are set in the smt_idle mask;
+ *  - when even _just_one_ of the SMT siblings in a core is not idle, all the
+ *bits correspondings to it and to all its siblings are clear in the
+ *smt_idle mask.
+ *
+ * Once we have such a mask, it is easy to implement a policy that, either:
+ *  - uses fully idle cores first: it is enough to try to schedule the vcpus
+ *on pcpus from smt_idle mask first. This is what happens if
+ *sched_smt_power_savings was not set at boot (default), and it maximizes
+ *true parallelism, and hence performance;
+ *  - uses already busy cores first: it is enough to try to schedule the vcpus
+ *on pcpus that are idle, but are not in smt_idle. This is what happens if
+ *sched_smt_power_savings is set at boot, and it allows as more cores as
+ *possible to stay in low power states, minimizing power consumption.
+ *
+ * This logic is entirely implemented in runq_tickle(), and that is enough.
+ * In fact, in this scheduler, placement of a vcpu on one of the pcpus of a
+ * runq, _always_ happens by means of tickling:
+ *  - when a vcpu wakes up, it calls csched2_vcpu_wake(), which calls
+ *runq_tickle();
+ *  - when a migration is initiated in schedule.c, we call csched2_cpu_pick(),
+ *csched2_vcpu_migrate() (which calls migrate()) and csched2_vcpu_wake().
+ *csched2_cpu_pick() looks for the least loaded runq and return just any
+ *of its processors. Then, csched2_vcpu_migrate() just moves the vcpu to
+ *the chosen runq, and it is again runq_tickle(), called by
+ *csched2_vcpu_wake() that actually decides what pcpu to use within the
+ *chosen runq;
+ *  - when a migration is initiated in sched_credit2.c, by calling  migrate()
+ *directly, that again temporarily use a random pcpu from the new runq,
+ *and then calls runq_tickle(), by itself.
+ */
+
+/*
+ * If all the siblings of cpu (including cpu itself) are in idlers,
+ * set all their bits in mask.
+ *
+ * In order to properly take into account tickling, idlers needs to be
+ * set qeual to something like:
+ *
+ *  rqd->idle & (~rqd->tickled)
+ *
+ * This is because cpus that have been tickled will very likely pick up some
+ * work as soon as the manage to schedule, and hence we should really consider
+ * them as busy.
+ */
+static inline
+void smt_idle_mask_set(unsigned int cpu, cpumask_t *idlers, cpumask_t *mask)
+{
+if ( cpumask_subset( per_cpu(cpu_sibling_mask, cpu), idlers) )
+cpumask_or(mask, mask, per_cpu(cpu_sibling_mask, cpu));
+}
+
+/*
+ * Clear the bits of all the siblings of cpu from mask.
+ */
+static inline
+void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
+{

Re: [Xen-devel] [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter

2016-06-21 Thread anshul makkar

On 18/06/16 00:11, Dario Faggioli wrote:

In fact, they always operate on the svc->processor of
the csched2_vcpu passed to them.

No functional change intended.

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>


Reviewed-by: Anshul Makkar <anshul.mak...@citrix.com>

---
Cc: George Dunlap <george.dnu...@citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
Cc: David Vrabel <david.vra...@citrix.com>
---
  xen/common/sched_credit2.c |   19 ++-
  1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 0246453..5881583 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -518,8 +518,9 @@ __runq_insert(struct list_head *runq, struct csched2_vcpu 
*svc)
  }

  static void
-runq_insert(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu 
*svc)
+runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
  {
+unsigned int cpu = svc->vcpu->processor;
  struct list_head * runq = (ops, cpu)->runq;
  int pos = 0;

@@ -558,17 +559,17 @@ void burn_credits(struct csched2_runqueue_data *rqd, 
struct csched2_vcpu *, s_ti
  /* Check to see if the item on the runqueue is higher priority than what's
   * currently running; if so, wake up the processor */
  static /*inline*/ void
-runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu 
*new, s_time_t now)
+runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t 
now)
  {
  int i, ipid=-1;
  s_time_t lowest=(1<<30);
+unsigned int cpu = new->vcpu->processor;
  struct csched2_runqueue_data *rqd = RQD(ops, cpu);
  cpumask_t mask;
  struct csched2_vcpu * cur;

  d2printk("rqt %pv curr %pv\n", new->vcpu, current);

-BUG_ON(new->vcpu->processor != cpu);
  BUG_ON(new->rqd != rqd);

  /* Look at the cpu it's running on first */
@@ -1071,8 +1072,8 @@ csched2_vcpu_wake(const struct scheduler *ops, struct 
vcpu *vc)
  update_load(ops, svc->rqd, svc, 1, now);

  /* Put the VCPU on the runq */
-runq_insert(ops, vc->processor, svc);
-runq_tickle(ops, vc->processor, svc, now);
+runq_insert(ops, svc);
+runq_tickle(ops, svc, now);

  out:
  d2printk("w-\n");
@@ -1104,8 +1105,8 @@ csched2_context_saved(const struct scheduler *ops, struct 
vcpu *vc)
  {
  BUG_ON(__vcpu_on_runq(svc));

-runq_insert(ops, vc->processor, svc);
-runq_tickle(ops, vc->processor, svc, now);
+runq_insert(ops, svc);
+runq_tickle(ops, svc, now);
  }
  else if ( !is_idle_vcpu(vc) )
  update_load(ops, svc->rqd, svc, -1, now);
@@ -1313,8 +1314,8 @@ static void migrate(const struct scheduler *ops,
  if ( on_runq )
  {
  update_load(ops, svc->rqd, NULL, 1, now);
-runq_insert(ops, svc->vcpu->processor, svc);
-runq_tickle(ops, svc->vcpu->processor, svc, now);
+runq_insert(ops, svc);
+runq_tickle(ops, svc, now);
  SCHED_STAT_CRANK(migrate_on_runq);
  }
  else




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone.

2016-06-21 Thread anshul makkar

On 18/06/16 00:11, Dario Faggioli wrote:

In both Credit1 and Credit2, stop considering a pCPU idle,
if the reason why the idle vCPU is being selected, is to
do tasklet work.

Not doing so means that the tickling and load balancing
logic, seeing the pCPU as idle, considers it a candidate
for picking up vCPUs. But the pCPU won't actually pick
up or schedule any vCPU, which would then remain in the
runqueue, which is bas, especially if there were other,
truly idle pCPUs, that could execute it.

The only drawback is that we can't assume that a pCPU is
in always marked as idle when being removed from an
instance of the Credit2 scheduler (csched2_deinit_pdata).
In fact, if we are in stop-machine (i.e., during suspend
or shutdown), the pCPUs are running the stopmachine_tasklet
and hence are actually marked as busy. On the other hand,
when removing a pCPU from a Credit2 pool, it will indeed
be idle. The only thing we can do, therefore, is to
remove the BUG_ON() check.

Signed-off-by: Dario Faggioli <dario.faggi...@citrix.com>


Reviewed-by: Anshul Makkar <anshul.mak...@citrix.com>

---
Cc: George Dunlap <george.dun...@citrix.com>
Cc: Anshul Makkar <anshul.mak...@citrix.com>
Cc: David Vrabel <david.vra...@citrix.com>
---
  xen/common/sched_credit.c  |   12 ++--
  xen/common/sched_credit2.c |   14 ++
  2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index a38a63d..a6645a2 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1819,24 +1819,24 @@ csched_schedule(
  else
  snext = csched_load_balance(prv, cpu, snext, );

+ out:
  /*
   * Update idlers mask if necessary. When we're idling, other CPUs
   * will tickle us when they get extra work.
   */
-if ( snext->pri == CSCHED_PRI_IDLE )
+if ( tasklet_work_scheduled || snext->pri != CSCHED_PRI_IDLE )
  {
-if ( !cpumask_test_cpu(cpu, prv->idlers) )
-cpumask_set_cpu(cpu, prv->idlers);
+if ( cpumask_test_cpu(cpu, prv->idlers) )
+cpumask_clear_cpu(cpu, prv->idlers);
  }
-else if ( cpumask_test_cpu(cpu, prv->idlers) )
+else if ( !cpumask_test_cpu(cpu, prv->idlers) )
  {
-cpumask_clear_cpu(cpu, prv->idlers);
+cpumask_set_cpu(cpu, prv->idlers);
  }

  if ( !is_idle_vcpu(snext->vcpu) )
  snext->start_time += now;

-out:
  /*
   * Return task to run next...
   */
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 1933ff1..cf8455c 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1910,8 +1910,16 @@ csched2_schedule(
  }
  else
  {
-/* Update the idle mask if necessary */
-if ( !cpumask_test_cpu(cpu, >idle) )
+/*
+ * Update the idle mask if necessary. Note that, if we're scheduling
+ * idle in order to carry on some tasklet work, we want to play busy!
+ */
+if ( tasklet_work_scheduled )
+{
+if ( cpumask_test_cpu(cpu, >idle) )
+cpumask_clear_cpu(cpu, >idle);
+}
+else if ( !cpumask_test_cpu(cpu, >idle) )
  cpumask_set_cpu(cpu, >idle);
  /* Make sure avgload gets updated periodically even
   * if there's no activity */
@@ -2291,8 +2299,6 @@ csched2_deinit_pdata(const struct scheduler *ops, void 
*pcpu, int cpu)
  /* No need to save IRQs here, they're already disabled */
  spin_lock(>lock);

-BUG_ON(!cpumask_test_cpu(cpu, >idle));
-
  printk("Removing cpu %d from runqueue %d\n", cpu, rqi);

  cpumask_clear_cpu(cpu, >idle);




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 03/34] xsm/xen_version: Add XSM for the xen_version hypercall

2016-03-24 Thread anshul makkar

On 15/03/16 17:56, Konrad Rzeszutek Wilk wrote:

All of XENVER_* have now an XSM check for their sub-ops.

The subop for XENVER_commandline is now a priviliged operation.
To not break guests we still return an string - but it is
just '\0'.

The rest: XENVER_[version|extraversion|capabilities|
parameters|get_features|page_size|guest_handle|changeset|
compile_info] behave as before - allowed by default for all
guests if using the XSM default policy or with the dummy one.

The admin can choose to change the sub-ops to be denied
as they see fit.

Also we add a local variable block.

Signed-off-by: Konrad Rzeszutek Wilk 

---
Cc: Daniel De Graaf 
Cc: Ian Jackson 
Cc: Stefano Stabellini 
Cc: Wei Liu 

v2: Do XSM check for all the XENVER_ ops.
v3: Add empty data conditions.
v4: Return  for priv subops.
v5: Move extraversion from priv to normal. Drop the XSM check
 for the non-priv subops.
v6: Add +1 for strlen(xen_deny()) to include NULL. Move changeset,
 compile_info to non-priv subops.
v7: Remove the \0 on xen_deny()
v8: Add new XSM domain for xenver hypercall. Add all subops to it.
v9: Remove the extra line, Add Ack from Daniel
v10: Rename the XSM from xen_version_op to xsm_xen_version.
 Prefix the types with 'xen' to distinguish it from another
 hypercall performing similar operation. Removed Ack from Daniel
 as it was so large. Add local variable block.
---
  tools/flask/policy/policy/modules/xen/xen.te | 15 
  xen/common/kernel.c  | 53 +---
  xen/common/version.c | 15 
  xen/include/xen/version.h|  2 +-
  xen/include/xsm/dummy.h  | 22 
  xen/include/xsm/xsm.h|  5 +++
  xen/xsm/dummy.c  |  1 +
  xen/xsm/flask/hooks.c| 43 ++
  xen/xsm/flask/policy/access_vectors  | 28 +++
  xen/xsm/flask/policy/security_classes|  1 +
  10 files changed, 172 insertions(+), 13 deletions(-)

diff --git a/tools/flask/policy/policy/modules/xen/xen.te 
b/tools/flask/policy/policy/modules/xen/xen.te
index d35ae22..7e7400d 100644
--- a/tools/flask/policy/policy/modules/xen/xen.te
+++ b/tools/flask/policy/policy/modules/xen/xen.te
@@ -73,6 +73,15 @@ allow dom0_t xen_t:xen2 {
  pmu_ctrl
  get_symbol
  };
+
+# Allow dom0 to use all XENVER_ subops
+# Note that dom0 is part of domain_type so this has duplicates.
+allow dom0_t xen_t:version {
+xen_version xen_extraversion xen_compile_info xen_capabilities
+xen_changeset xen_platform_parameters xen_get_features xen_pagesize
+xen_guest_handle xen_commandline
+};
+
  allow dom0_t xen_t:mmu memorymap;

  # Allow dom0 to use these domctls on itself. For domctls acting on other
@@ -137,6 +146,12 @@ if (guest_writeconsole) {
  # pmu_ctrl is for)
  allow domain_type xen_t:xen2 pmu_use;

+# For normal guests all except XENVER_commandline
+allow domain_type xen_t:version {
+xen_version xen_extraversion xen_compile_info xen_capabilities
+xen_changeset xen_platform_parameters xen_get_features xen_pagesize
+xen_guest_handle
+};
  
###
  #
  # Domain creation
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 0618da2..2699ac0 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -13,6 +13,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -223,12 +224,15 @@ void __init do_initcalls(void)
  /*
   * Simple hypercalls.
   */
-
  DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
  {
+bool_t deny = !!xsm_xen_version(XSM_OTHER, cmd);
+
  switch ( cmd )
  {
  case XENVER_version:
+if ( deny )
+return 0;
  return (xen_major_version() << 16) | xen_minor_version();

  case XENVER_extraversion:
@@ -236,7 +240,7 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
  xen_extraversion_t extraversion;

  memset(extraversion, 0, sizeof(extraversion));
-safe_strcpy(extraversion, xen_extra_version());
+safe_strcpy(extraversion, deny ? xen_deny() : xen_extra_version());
  if ( copy_to_guest(arg, extraversion, ARRAY_SIZE(extraversion)) )
  return -EFAULT;
  return 0;
@@ -247,10 +251,10 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
  xen_compile_info_t info;

  memset(, 0, sizeof(info));
-safe_strcpy(info.compiler,   xen_compiler());
-safe_strcpy(info.compile_by, xen_compile_by());
-safe_strcpy(info.compile_domain, xen_compile_domain());
-safe_strcpy(info.compile_date,   xen_compile_date());
+safe_strcpy(info.compiler,   deny ? xen_deny() : 

Re: [Xen-devel] [PATCH] flask: change default state to enforcing

2016-03-15 Thread Anshul Makkar


-Original Message-
From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of Konrad 
Rzeszutek Wilk
Sent: 10 March 2016 19:12
To: Daniel De Graaf ; Ian Jackson 
; jbeul...@suse.com
Cc: xen-de...@lists.xenproject.org; car...@cardoe.com; Andrew Cooper 

Subject: Re: [Xen-devel] [PATCH] flask: change default state to enforcing

On Thu, Mar 10, 2016 at 01:30:29PM -0500, Daniel De Graaf wrote:

I've added Ian and Jan on the email as scripts/get_maintainer.pl spits out 
their names (Oddly not yours?)
> The previous default of "permissive" is meant for developing or 
> debugging a disaggregated system.  However, this default makes it too 
> easy to accidentally boot a machine in this state, which does not 
> place any restrictions on guests.  This is not suitable for normal 
> systems because any guest can perform any operation (including 
> operations like rebooting the machine, kexec, and reading or writing 
> another domain's memory).
> 
> This change will cause the boot to fail if you do not specify an XSM 
> policy during boot; if you need to load a policy from dom0, use the 
> "flask=late" boot parameter.
> 
> Originally by Konrad Rzeszutek Wilk ; modified 
> to also change the default value of flask_enforcing so that the policy 
> is not still in permissive mode.  This also removes the (no longer
> documented) command line argument directly changing that variable 
> since it has been superseded by the flask= parameter.
> 

Reviwed-by: Konrad Rzeszutek Wilk  .. however:

> Signed-off-by: Daniel De Graaf 
> ---
> 
>  docs/misc/xen-command-line.markdown |  2 +-
>  docs/misc/xsm-flask.txt | 12 ++--
>  xen/xsm/flask/flask_op.c|  8 +---
>  3 files changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/docs/misc/xen-command-line.markdown 
> b/docs/misc/xen-command-line.markdown
> index ca77e3b..9e77f8a 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -662,7 +662,7 @@ to use the default.
>  ### flask
>  > `= permissive | enforcing | late | disabled`
>  
> -> Default: `permissive`
> +> Default: `enforcing`
>  
>  Specify how the FLASK security server should be configured.  This 
> option is only  available if the hypervisor was compiled with XSM 
> support (which can be enabled diff --git a/docs/misc/xsm-flask.txt 
> b/docs/misc/xsm-flask.txt index fb2fe9f..00a2b13 100644
> --- a/docs/misc/xsm-flask.txt
> +++ b/docs/misc/xsm-flask.txt
> @@ -283,12 +283,12 @@ for passthrough, run:
>  
>  This command must be rerun on each boot or after any policy reload.
>  
> -The example policy was only tested with simple domain creation and 
> may be -missing rules allowing accesses by dom0 or domU when a number 
> of hypervisor -features are used. When first loading or writing a 
> policy, you should run FLASK -in permissive mode (the default) and 
> check the Xen logs (xl dmesg) for AVC -denials before using it in 
> enforcing mode (flask_enforcing=1 on the command -line, or xl setenforce).
> +When first loading or writing a policy, you should run FLASK in 
> +permissive mode (flask=permissive on the command line) and check the 
> +Xen logs (xl dmesg) for AVC denials before using it in enforcing mode 
> +(the default value of the boot parameter, which can also be changed 
> +using xl setenforce).  When using the default types for domains 
> +(domU_t), the example policy shipped with Xen should allow the same 
> operations on or between domains as when not using FLASK.
>  
>  
>  MLS/MCS policy
> diff --git a/xen/xsm/flask/flask_op.c b/xen/xsm/flask/flask_op.c index 
> f4f5dd1..cdb462c 100644
> --- a/xen/xsm/flask/flask_op.c
> +++ b/xen/xsm/flask/flask_op.c
> @@ -25,12 +25,11 @@
>  #define _copy_to_guest copy_to_guest
>  #define _copy_from_guest copy_from_guest
>  
> -enum flask_bootparam_t __read_mostly flask_bootparam = 
> FLASK_BOOTPARAM_PERMISSIVE;
> +enum flask_bootparam_t __read_mostly flask_bootparam = 
> +FLASK_BOOTPARAM_ENFORCING;
>  static void parse_flask_param(char *s);  custom_param("flask", 
> parse_flask_param);
>  
> -bool_t __read_mostly flask_enforcing = 0; 
> -boolean_param("flask_enforcing", flask_enforcing);
> +bool_t __read_mostly flask_enforcing = 1;

Since you set that to the default value should the parse_flask_param 
'flask_enforcing = 1' for the 'enforcing' and 'late' be removed?

(If you agree, the committer could do it).

>  
>  #define MAX_POLICY_SIZE 0x400
>  
> @@ -76,7 +75,10 @@ static void __init parse_flask_param(char *s)
>  else if ( !strcmp(s, "disabled") )
>  flask_bootparam = FLASK_BOOTPARAM_DISABLED;
>  else if ( !strcmp(s, "permissive") )
> +{
> +flask_enforcing = 0;
>  flask_bootparam = FLASK_BOOTPARAM_PERMISSIVE;
> +}
>  else
>  flask_bootparam = FLASK_BOOTPARAM_INVALID;  }
> 

Re: [Xen-devel] [PATCH v2 2/3] xen: add hypercall option to temporarily pin a vcpu

2016-03-02 Thread Anshul Makkar
Hi,


-Original Message-
From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of George 
Dunlap
Sent: 01 March 2016 15:53
To: Juergen Gross <jgr...@suse.com>; xen-devel@lists.xen.org
Cc: Wei Liu <wei.l...@citrix.com>; Stefano Stabellini 
<stefano.stabell...@citrix.com>; George Dunlap <george.dun...@citrix.com>; 
Andrew Cooper <andrew.coop...@citrix.com>; Dario Faggioli 
<dario.faggi...@citrix.com>; Ian Jackson <ian.jack...@citrix.com>; David Vrabel 
<david.vra...@citrix.com>; jbeul...@suse.com
Subject: Re: [Xen-devel] [PATCH v2 2/3] xen: add hypercall option to 
temporarily pin a vcpu

On 01/03/16 09:02, Juergen Gross wrote:
> Some hardware (e.g. Dell studio 1555 laptops) require SMIs to be 
> called on physical cpu 0 only. Linux drivers like dcdbas or i8k try to 
> achieve this by pinning the running thread to cpu 0, but in Dom0 this 
> is not enough: the vcpu must be pinned to physical cpu 0 via Xen, too.
> 
> Add a stable hypercall option SCHEDOP_pin_temp to the sched_op 
> hypercall to achieve this. It is taking a physical cpu number as 
> parameter. If pinning is possible (the calling domain has the 
> privilege to make the call and the cpu is available in the domain's
> cpupool) the calling vcpu is pinned to the specified cpu. The old cpu 
> affinity is saved. To undo the temporary pinning a cpu -1 is 
> specified. This will restore the original cpu affinity for the vcpu.
> 
> Signed-off-by: Juergen Gross <jgr...@suse.com>
> ---
> V2: - limit operation to hardware domain as suggested by Jan Beulich
> - some style issues corrected as requested by Jan Beulich
> - use fixed width types in interface as requested by Jan Beulich
> - add compat layer checking as requested by Jan Beulich
> ---
>  xen/common/compat/schedule.c |  4 ++
>  xen/common/schedule.c| 92 
> +---
>  xen/include/public/sched.h   | 17 
>  xen/include/xlat.lst |  1 +
>  4 files changed, 109 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/common/compat/schedule.c 
> b/xen/common/compat/schedule.c index 812c550..73b0f01 100644
> --- a/xen/common/compat/schedule.c
> +++ b/xen/common/compat/schedule.c
> @@ -10,6 +10,10 @@
>  
>  #define do_sched_op compat_sched_op
>  
> +#define xen_sched_pin_temp sched_pin_temp CHECK_sched_pin_temp; 
> +#undef xen_sched_pin_temp
> +
>  #define xen_sched_shutdown sched_shutdown  CHECK_sched_shutdown;  
> #undef xen_sched_shutdown diff --git a/xen/common/schedule.c 
> b/xen/common/schedule.c index b0d4b18..653f852 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -271,6 +271,12 @@ int sched_move_domain(struct domain *d, struct cpupool 
> *c)
>  struct scheduler *old_ops;
>  void *old_domdata;
>  
> +for_each_vcpu ( d, v )
> +{
> +if ( v->affinity_broken )
> +return -EBUSY;
> +}
> +
>  domdata = SCHED_OP(c->sched, alloc_domdata, d);
>  if ( domdata == NULL )
>  return -ENOMEM;
> @@ -669,6 +675,14 @@ int cpu_disable_scheduler(unsigned int cpu)
>  if ( cpumask_empty(_affinity) &&
>   cpumask_test_cpu(cpu, v->cpu_hard_affinity) )
>  {
> +if ( v->affinity_broken )
> +{
> +/* The vcpu is temporarily pinned, can't move it. */
> +vcpu_schedule_unlock_irqrestore(lock, flags, v);
> +ret = -EBUSY;
> +break;
> +}

Does this mean that if the user closes the laptop lid while one of these 
drivers has vcpu0 pinned, that Xen will crash (see 
xen/arch/x86/smpboot.c:__cpu_disable())?  Or is it the OS's job to make sure 
that all temporary pins are removed before suspending?

Also -- have you actually tested the "cpupool move while pinned"
functionality to make sure it actually works?  There's a weird bit in
cpupool_unassign_cpu_helper() where after calling cpu_disable_scheduler(cpu), 
it unconditionally sets the cpu bit in the cpupool_free_cpus mask, even if it 
returns an error.  That can't be right, even for the existing -EAGAIN case, can 
it?

I see that you have a loop to retry this call several times in the next patch; 
but what if it fails every time -- what state is the system in?

And, in general, what happens if the device driver gets mixed up and forgets to 
unpin the vcpu?  Is the only recourse to reboot your host (or deal with the 
fact that you can't reconfigure your cpupools)?

 -George

Sorry, lost the original thread so replying at the top of mail chain.

+static XSM_INLINE int xsm_schedop_pin_temp(XSM_DEFAULT_VOID) 
+{ 
+ XSM_ASSERT_ACTI

Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-01 Thread Anshul Makkar
Based on the discussion below, can I assume there is an agreement for using 
processor model for filtering or chipset ID will be the preferred candidate.

Thanks
Anshul Makkar

-Original Message-
From: Tian, Kevin [mailto:kevin.t...@intel.com] 
Sent: 26 November 2015 07:17
To: Malcolm Crossley <malcolm.cross...@citrix.com>; Jan Beulich 
<jbeul...@suse.com>; Andrew Cooper <andrew.coop...@citrix.com>; Anshul Makkar 
<anshul.mak...@citrix.com>
Cc: Zhang, Yang Z <yang.z.zh...@intel.com>; xen-devel@lists.xen.org
Subject: RE: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for 
Sandybridge and earlier processors.

> From: Malcolm Crossley [mailto:malcolm.cross...@citrix.com]
> Sent: Wednesday, November 25, 2015 11:59 PM
> 
> On 25/11/15 15:38, Jan Beulich wrote:
> >>>> On 25.11.15 at 16:13, <andrew.coop...@citrix.com> wrote:
> >> On 25/11/15 10:49, Jan Beulich wrote:
> >>>>>> On 25.11.15 at 11:28, <andrew.coop...@citrix.com> wrote:
> >>>> On 24/11/15 17:41, Jan Beulich wrote:
> >>>>>>>> On 24.11.15 at 18:17,  wrote:
> >>>>>> --- a/xen/drivers/passthrough/vtd/quirks.c
> >>>>>> +++ b/xen/drivers/passthrough/vtd/quirks.c
> >>>>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
> >>>>>>  /* Tylersburg interrupt remap quirk */
> >>>>>>  if ( iommu_intremap )
> >>>>>>  tylersburg_intremap_quirk();
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * Disable shared EPT ("sharept") on Sandybridge and older 
> >>>>>> processors
> >>>>>> + * by default.
> >>>>>> + * SandyBridge has no huge page support for IOTLB which 
> >>>>>> + leads to
> >> fallback
> >>>>>> + * on 4k pages and leads to performance degradation.
> >>>>>> + *
> >>>>>> + * Shared EPT ("sharept") will be disabled only if user has not
> >>>>>> + * provided explicit choice on the command line thus 
> >>>>>> + iommu_hap_pt_share
> >> is
> >>>>>> + * at its initialized value of -1.
> >>>>>> + */
> >>>>>> +if ( (boot_cpu_data.x86 == 0x06 && 
> >>>>>> + (boot_cpu_data.x86_model <= 0x2F
> ||
> >>>>>> +  boot_cpu_data.x86_model == 0x36)) && 
> >>>>>> + (iommu_hap_pt_share ==
> -1) )
> >>>>>> +iommu_hap_pt_share = 0;
> >>>>> If we really want to do this, then I think we should key this on 
> >>>>> EPT but not VT-d having 2M support, instead of on CPU models.
> >>>> This check is already performed by vtd_ept_page_compatible()
> >>> Yeah, I realized there would be such a check on the way home.
> >>>
> >>>> The problem is that SandyBridge IOMMUs advertise 2M support and 
> >>>> do function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>>
> >>>> As a result, attempting to use 2M translations causes 
> >>>> substantially worse performance than 4K translations.
> >>> So commit message and comment should make this more explicit, to 
> >>> avoid the impression "IOTLB" isn't just the relatively common 
> >>> mis-naming of "IOMMU".
> >>>
> >>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
> >>>
> >>> Further the model based check is relatively broad, and includes 
> >>> Atoms (0x36 actually is one), which can't be considered 
> >>> "Sandybridge or older" imo.
> >>>
> >>> And finally I'm not fully convinced using CPU model info to deduce 
> >>> chipset behavior is entirely correct (albeit perhaps in practice 
> >>> it'll be fine except maybe when running Xen itself virtualized).
> >>
> >> What else would you suggest? I can't think of any better 
> >> identifying information.
> >
> > Chipset IDs / revisions?
> 
> In this case the IOMMU is integrated into the Sandybridge-EP processor itself.
> Unfortunately there's no register to query the IOTLB configuration of 
> the IOMMU and so we're stuck identifying the via the processor model number 
> itself.
> 
> Malcolm
> 

I'm OK to use processor model here, though ideally Jan is right. :-)

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-01 Thread Anshul Makkar
Snabbswitch (virtualized switch) also encountered similar problem : 
https://groups.google.com/forum/#!topic/snabb-devel/xX0yFzeXylI

Thanks
Anshul Makkar

-Original Message-
From: Andrew Cooper [mailto:andrew.coop...@citrix.com] 
Sent: 01 December 2015 10:34
To: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
Cc: Jan Beulich <jbeul...@suse.com>; Kevin Tian <kevin.t...@intel.com>; 
yang.z.zh...@intel.com; Malcolm Crossley <malcolm.cross...@citrix.com>; Anshul 
Makkar <anshul.mak...@citrix.com>; xen-devel@lists.xen.org
Subject: Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for 
Sandybridge and earlier processors.

On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote:
>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>> On 26/11/15 13:46, Jan Beulich wrote:
>>>>>>> On 25.11.15 at 11:28, <andrew.coop...@citrix.com> wrote:
>>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do 
>>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>>>
>>>>> As a result, attempting to use 2M translations causes 
>>>>> substantially worse performance than 4K translations.
>>>> Btw - how does this get explained? At a first glance, even if 2Mb 
>>>> translations don't get entered into the TLB, it should still be one 
>>>> less page table level to walk for the IOMMU, and should hence 
>>>> nevertheless be a benefit. Yet you even say _substantially_ worse 
>>>> performance results.
>>> There is a IOTLB for the 4K translation so if you only use 4K 
>>> translations then you get to take advantage of the IOTLB.
>>>
>>> If you use the 2Mb translation then a page table walk has to be 
>>> performed every time there's a DMA access to that region of the BFN 
>>> address space.
>> Also remember that a high level dma access (from the point of view of 
>> a
>> driver) will be fragmented at the PCIe max packet size, which is 
>> typically 256 bytes.
>>
>> So by not caching the 2Mb translation, a dma access of 4k may undergo 
>> 16 pagetable walks, one for each PCIe packet.
>>
>> We observed that using 2Mb mappings results in a 40% overhead, 
>> compared to using 4k mappings, from the point of view of a sample network 
>> workload.
> How did you observe this? I am mighty curious what kind of performance 
> tools you used to find this  as I would love to figure out if some of 
> the issues we have seen are related to this?

The 40% difference is just in terms of network throughput of a VF, given a 
workload which can normally saturate line rate on the card.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-24 Thread Anshul Makkar anshul.makkar
From: Anshul Makkar <anshul.mak...@citrix.com>

Sandybridge or earlier processors don't have huge page support for
IOTLB which leads to fallback on 4k pages and causes performance issues.

Shared EPT will be disabled only if the user has not provided explicit
choice on the command line.

Signed-off-by: Anshul Makkar <anshul.mak...@citrix.com>
---
v2:
   * Removed the use of extra variable to control the shared EPT and made
 the existent variable as tristate.
   * Narrowed down the check for processors to Sandybridge and older including
 Atom processors.

 docs/misc/xen-command-line.markdown  |  2 +-
 xen/drivers/passthrough/iommu.c  |  2 +-
 xen/drivers/passthrough/vtd/quirks.c | 14 ++
 xen/include/xen/iommu.h  |  2 +-
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index a2e427c..6b69ba2 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -896,7 +896,7 @@ debug hypervisor only).
 
 > `sharept`
 
-> Default: `true`
+> Default: `true` if newer than SandyBridge or `false` if Sandybridge or 
earlier.
 
 >> Control whether CPU and IOMMU page tables should be shared.
 
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index d5137733..9367987 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -51,7 +51,7 @@ bool_t __read_mostly iommu_passthrough;
 bool_t __read_mostly iommu_snoop = 1;
 bool_t __read_mostly iommu_qinval = 1;
 bool_t __read_mostly iommu_intremap = 1;
-bool_t __read_mostly iommu_hap_pt_share = 1;
+s8 __read_mostly iommu_hap_pt_share = -1;
 bool_t __read_mostly iommu_debug;
 bool_t __read_mostly amd_iommu_perdev_intremap = 1;
 
diff --git a/xen/drivers/passthrough/vtd/quirks.c 
b/xen/drivers/passthrough/vtd/quirks.c
index 143..7d63c8d 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
 /* Tylersburg interrupt remap quirk */
 if ( iommu_intremap )
 tylersburg_intremap_quirk();
+
+/*
+ * Disable shared EPT ("sharept") on Sandybridge and older processors
+ * by default.
+ * SandyBridge has no huge page support for IOTLB which leads to fallback
+ * on 4k pages and leads to performance degradation.
+ *
+ * Shared EPT ("sharept") will be disabled only if user has not
+ * provided explicit choice on the command line thus iommu_hap_pt_share is
+ * at its initialized value of -1.
+ */
+if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
+  boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
+iommu_hap_pt_share = 0;
 }
 
 /*
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 8f3a20e..d52d06f 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -31,7 +31,7 @@ extern bool_t iommu_enable, iommu_enabled;
 extern bool_t force_iommu, iommu_verbose;
 extern bool_t iommu_workaround_bios_bug, iommu_igfx, iommu_passthrough;
 extern bool_t iommu_snoop, iommu_qinval, iommu_intremap;
-extern bool_t iommu_hap_pt_share;
+extern s8 iommu_hap_pt_share;
 extern bool_t iommu_debug;
 extern bool_t amd_iommu_perdev_intremap;
 
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel