Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-10 Thread Srivatsa Vaddagiri
On Thu, Dec 09, 2010 at 11:34:46PM -0500, Rik van Riel wrote:
 On 12/03/2010 09:06 AM, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:
 No, because they do receive service (they spend some time spinning
 before being interrupted), so the respective vruntimes will increase, at
 some point they'll pass B0 and it'll get scheduled.
 
 Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
 case)?
 
 I have a rough idea for a simpler way to ensure
 fairness.
 
 At yield_to time, we could track in the runqueue
 structure that a task received CPU time (and on
 the other runqueue that a task donated CPU time).
 
 The balancer can count time-given-to CPUs as
 busier, and donated-time CPUs as less busy,
 moving tasks away in the unlikely event that
 the same task gets keeping CPU time given to
 it.

I think just capping donation (either on send side or receive side) may be more 
simpler here than to mess with load balancer logic.

 Conversely, it can move other tasks onto CPUs
 that have tasks on them that cannot make progress
 right now and are just donating their CPU time.
 
 Most of the time the time-given and time-received
 should balance out and there should be little to
 no influence on the load balancer. This code would
 just be there to deal with exceptional circumstances,
 to avoid the theoretical worst case people have
 described.

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-10 Thread Rik van Riel

On 12/10/2010 03:39 AM, Srivatsa Vaddagiri wrote:

On Thu, Dec 09, 2010 at 11:34:46PM -0500, Rik van Riel wrote:

On 12/03/2010 09:06 AM, Srivatsa Vaddagiri wrote:

On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:

No, because they do receive service (they spend some time spinning
before being interrupted), so the respective vruntimes will increase, at
some point they'll pass B0 and it'll get scheduled.


Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
case)?


I have a rough idea for a simpler way to ensure
fairness.

At yield_to time, we could track in the runqueue
structure that a task received CPU time (and on
the other runqueue that a task donated CPU time).

The balancer can count time-given-to CPUs as
busier, and donated-time CPUs as less busy,
moving tasks away in the unlikely event that
the same task gets keeping CPU time given to
it.


I think just capping donation (either on send side or receive side) may be more
simpler here than to mess with load balancer logic.


Do you have any ideas on how to implement this in a simple
enough way that it may be acceptable upstream? :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-09 Thread Rik van Riel

On 12/03/2010 09:06 AM, Srivatsa Vaddagiri wrote:

On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:

No, because they do receive service (they spend some time spinning
before being interrupted), so the respective vruntimes will increase, at
some point they'll pass B0 and it'll get scheduled.


Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
case)?


I have a rough idea for a simpler way to ensure
fairness.

At yield_to time, we could track in the runqueue
structure that a task received CPU time (and on
the other runqueue that a task donated CPU time).

The balancer can count time-given-to CPUs as
busier, and donated-time CPUs as less busy,
moving tasks away in the unlikely event that
the same task gets keeping CPU time given to
it.

Conversely, it can move other tasks onto CPUs
that have tasks on them that cannot make progress
right now and are just donating their CPU time.

Most of the time the time-given and time-received
should balance out and there should be little to
no influence on the load balancer. This code would
just be there to deal with exceptional circumstances,
to avoid the theoretical worst case people have
described.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-08 Thread Rik van Riel

On 12/03/2010 08:23 AM, Peter Zijlstra wrote:

On Thu, 2010-12-02 at 14:44 -0500, Rik van Riel wrote:
unsigned long clone_flags);

+
+#ifdef CONFIG_SCHED_HRTICK
+extern u64 slice_remain(struct task_struct *);
+extern void yield_to(struct task_struct *);
+#else
+static inline void yield_to(struct task_struct *p) yield()
+#endif


That does SCHED_HRTICK have to do with any of this?


Legacy from an old prototype this patch is based on.
I'll get rid of that.


+/**
+ * requeue_task - requeue a task which priority got changed by yield_to


priority doesn't seem the right word, you're not actually changing
anything related to p-*prio


True, I'll change the comment.


+ * @rq: the tasks's runqueue
+ * @p: the task in question
+ * Must be called with the runqueue lock held. Will cause the CPU to
+ * reschedule if p is now at the head of the runqueue.
+ */
+void requeue_task(struct rq *rq, struct task_struct *p)
+{
+   assert_spin_locked(rq-lock);
+
+   if (!p-se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
+   return;
+
+   dequeue_task(rq, p, 0);
+   enqueue_task(rq, p, 0);
+
+   resched_task(p);


I guess that wants to be something like check_preempt_curr()


Done.  Thanks for pointing that out.


@@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned 
int, len,
return ret;
  }

+#ifdef CONFIG_SCHED_HRTICK


Still wondering what all this has to do with SCHED_HRTICK..


+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the same
+ * process, eg. when p holds a resource other threads are waiting for.
+ * Giving priority to p may help get that resource released sooner.
+ */
+void yield_to(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se =p-se;
+   struct rq *rq;
+   struct cfs_rq *cfs_rq;
+   u64 remain = slice_remain(current);
+
+   rq = task_rq_lock(p,flags);
+   if (task_running(rq, p) || task_has_rt_policy(p))
+   goto out;


See, this all ain't nice, slice_remain() don't make no sense to be
called for !fair tasks.

Why not write:

   if (curr-sched_class == p-sched_class
   curr-sched_class-yield_to)
curr-sched_class-yield_to(curr, p);

or something, and then implement sched_class_fair::yield_to only,
leaving it a NOP for all other classes.


Done.


+   cfs_rq = cfs_rq_of(se);
+   se-vruntime -= remain;
+   if (se-vruntime  cfs_rq-min_vruntime)
+   se-vruntime = cfs_rq-min_vruntime;


Now here we have another problem, remain was measured in wall-time, and
then you go change a virtual time measure using that. These things are
related like:

  vt = t/weight

So you're missing a weight factor somewhere.

Also, that check against min_vruntime doesn't really make much sense.


OK, how do I do this?


+   requeue_task(rq, p);


Just makes me wonder why you added requeue task to begin with.. why not
simply dequeue at the top of this function, and enqueue at the tail,
like all the rest does: see rt_mutex_setprio(), set_user_nice(),
sched_move_task().


Done.


+ out:
+   task_rq_unlock(rq,flags);
+   yield();
+}
+EXPORT_SYMBOL(yield_to);


EXPORT_SYMBOL_GPL() pretty please, I really hate how kvm is a module and
needs to export hooks all over the core kernel :/


Done.


Right, so another approach might be to simply swap the vruntime between
curr and p.


Doesn't that run into the same scale issue you described
above?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-08 Thread Peter Zijlstra
On Wed, 2010-12-08 at 12:55 -0500, Rik van Riel wrote:
 
 
  Right, so another approach might be to simply swap the vruntime between
  curr and p.
 
 Doesn't that run into the same scale issue you described
 above?

Not really, but its tricky on SMP because vruntime only has meaning
within a cfs_rq.

The below is quickly cobbled together from a few patches I had laying
about dealing with similar issues.

The avg_vruntime() stuff takes the weighted average of the vruntimes on
a cfs runqueue, this weighted average corresponds to the 0-lag point,
that is the point where an ideal scheduler would have all tasks.

Using the 0-lag point you can compute the lag of a task, the lag is a
measure of service for a task, negative lag means the task is owed
services, positive lag means its got too much service (at least, thats
the sign convention used here, I always forget what the standard is).

What we do is, when @p the target task, is owed less service than
current, we flip lags such that p will become more eligible.

The trouble with all this is that computing the weighted average is
terribly expensive (it increases cost of all scheduler hot paths).

The dyn_vruntime stuff mixed in there is an attempt at computing
something similar, although its not used and I haven't tested the
quality of the approximation in a while.

Anyway, complete untested and such..

---
 include/linux/sched.h   |2 +
 kernel/sched.c  |   39 ++
 kernel/sched_debug.c|   31 -
 kernel/sched_fair.c |  179 ++-
 kernel/sched_features.h |8 +--
 5 files changed, 203 insertions(+), 56 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b0fc8ee..538559e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1095,6 +1095,8 @@ struct sched_class {
 #ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_move_group) (struct task_struct *p, int on_rq);
 #endif
+
+   void (*yield_to) (struct task_struct *p);
 };
 
 struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index 0debad9..fe0adb0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -313,6 +313,9 @@ struct cfs_rq {
struct load_weight load;
unsigned long nr_running;
 
+   s64 avg_vruntime;
+   u64 avg_load;
+
u64 exec_clock;
u64 min_vruntime;
 
@@ -5062,6 +5065,42 @@ SYSCALL_DEFINE0(sched_yield)
return 0;
 }
 
+void yield_to(struct task_struct *p)
+{
+   struct task_struct *curr = current;
+   struct rq *p_rq, *rq;
+   unsigned long flags;
+   int on_rq;
+
+   local_irq_save(flags);
+   rq = this_rq();
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   if (p_rq != task_rq(p)) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   update_rq_clock(rq);
+   update_rq_clock(p_rq);
+
+   on_rq = p-se.on_rq;
+   if (on_rq)
+   dequeue_task(p_rq, p, 0);
+
+   ret = 0;
+   if (p-sched_class == curr-sched_class  curr-sched_class-yield_to)
+   curr-sched_class-yield_to(p);
+
+   if (on_rq)
+   enqueue_task(p_rq, p, 0);
+
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 static inline int should_resched(void)
 {
return need_resched()  !(preempt_count()  PREEMPT_ACTIVE);
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 1dfae3d..b5cc773 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -138,10 +138,9 @@ static void print_rq(struct seq_file *m, struct rq *rq, 
int rq_cpu)
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-   s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
-   spread, rq0_min_vruntime, spread0;
+   s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
struct rq *rq = cpu_rq(cpu);
-   struct sched_entity *last;
+   struct sched_entity *last, *first;
unsigned long flags;
 
SEQ_printf(m, \ncfs_rq[%d]:\n, cpu);
@@ -149,28 +148,26 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct 
cfs_rq *cfs_rq)
SPLIT_NS(cfs_rq-exec_clock));
 
raw_spin_lock_irqsave(rq-lock, flags);
-   if (cfs_rq-rb_leftmost)
-   MIN_vruntime = (__pick_next_entity(cfs_rq))-vruntime;
+   first = __pick_first_entity(cfs_rq);
+   if (first)
+   left_vruntime = first-vruntime;
last = __pick_last_entity(cfs_rq);
if (last)
-   max_vruntime = last-vruntime;
+   right_vruntime = last-vruntime;
min_vruntime = cfs_rq-min_vruntime;
-   rq0_min_vruntime = cpu_rq(0)-cfs.min_vruntime;
raw_spin_unlock_irqrestore(rq-lock, flags);
-   SEQ_printf(m,   .%-30s: %Ld.%06ld\n, MIN_vruntime,
-   SPLIT_NS(MIN_vruntime));
+   SEQ_printf(m,   .%-30s: %Ld.%06ld\n, left_vruntime,
+   

Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-08 Thread Peter Zijlstra
On Wed, 2010-12-08 at 21:00 +0100, Peter Zijlstra wrote:
 +   lag0 = avg_vruntime(cfs_rq_of(se));
 +   p_lag0 = avg_vruntime(cfs_rq_of(p_se));
 +
 +   lag = se-vruntime - avg_vruntime(cfs_rq);
 +   p_lag = p_se-vruntime - avg_vruntime(p_cfs_rq);
 +
 +   if (p_lag  lag) { /* if P is owed less service */
 +   se-vruntime = lag0 + p_lag;
 +   p_se-vruntime = p_lag + lag;
 +   } 

clearly that should read:

  p_se-vruntime = p_lag0 + lag;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-08 Thread Rik van Riel

On 12/08/2010 03:00 PM, Peter Zijlstra wrote:


Anyway, complete untested and such..


Looks very promising.  I've been making a few changes in the same
direction (except for the fancy CFS bits) and have one way to solve
the one problem you pointed out in your patch.


+void yield_to(struct task_struct *p)
+{

...

+   on_rq = p-se.on_rq;
+   if (on_rq)
+   dequeue_task(p_rq, p, 0);
+
+   ret = 0;
+   if (p-sched_class == curr-sched_class  curr-sched_class-yield_to)
+   curr-sched_class-yield_to(p);
+
+   if (on_rq)
+   enqueue_task(p_rq, p, 0);



diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c886717..8689bcd 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c



+static void yield_to_fair(struct task_stuct *p)
+{
+   struct sched_entity *se =current-se;
+   struct sched_entity *p_se =p-se;
+   u64 lag0, p_lag0;
+   s64 lag, p_lag;
+
+   lag0 = avg_vruntime(cfs_rq_of(se));
+   p_lag0 = avg_vruntime(cfs_rq_of(p_se));
+
+   lag = se-vruntime - avg_vruntime(cfs_rq);
+   p_lag = p_se-vruntime - avg_vruntime(p_cfs_rq);
+
+   if (p_lag  lag) { /* if P is owed less service */
+   se-vruntime = lag0 + p_lag;
+   p_se-vruntime = p_lag + lag;
+   }
+
+   /*
+* XXX try something smarter here
+*/
+   resched_task(p);
+   resched_task(current);
+}


If we do the dequeue_task and enqueue_task here, we can use
check_preempt_curr in yield_to_fair.

Alternatively, we can do the rescheduling from the main
yield_to function, not from yield_to_fair, by calling
check_preempt_curr on p and current after p has been
enqueued.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-04 Thread Rik van Riel

On 12/03/2010 04:23 PM, Peter Zijlstra wrote:

On Fri, 2010-12-03 at 19:40 +0530, Srivatsa Vaddagiri wrote:

On Fri, Dec 03, 2010 at 07:36:07PM +0530, Srivatsa Vaddagiri wrote:

On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:

No, because they do receive service (they spend some time spinning
before being interrupted), so the respective vruntimes will increase, at
some point they'll pass B0 and it'll get scheduled.


Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
case)?


Hmm perhaps yes, althought at cost of tons of context switches, which would be
nice to minimize on?


Don't care, as long as the guys calling yield_to() pay for that time its
their problem.


Also, the context switches are cheaper than spinning
entire time slices on spinlocks we're not going to get
(because the VCPU holding the lock is not running).

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Thu, 2010-12-02 at 14:44 -0500, Rik van Riel wrote:
unsigned long clone_flags);
 +
 +#ifdef CONFIG_SCHED_HRTICK
 +extern u64 slice_remain(struct task_struct *);
 +extern void yield_to(struct task_struct *);
 +#else
 +static inline void yield_to(struct task_struct *p) yield()
 +#endif

That does SCHED_HRTICK have to do with any of this?

  #ifdef CONFIG_SMP
   extern void kick_process(struct task_struct *tsk);
  #else
 diff --git a/kernel/sched.c b/kernel/sched.c
 index f8e5a25..ef088cd 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -1909,6 +1909,26 @@ static void dequeue_task(struct rq *rq, struct 
 task_struct *p, int sleep)
   p-se.on_rq = 0;
  }
  
 +/**
 + * requeue_task - requeue a task which priority got changed by yield_to

priority doesn't seem the right word, you're not actually changing
anything related to p-*prio

 + * @rq: the tasks's runqueue
 + * @p: the task in question
 + * Must be called with the runqueue lock held. Will cause the CPU to
 + * reschedule if p is now at the head of the runqueue.
 + */
 +void requeue_task(struct rq *rq, struct task_struct *p)
 +{
 + assert_spin_locked(rq-lock);
 +
 + if (!p-se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
 + return;
 +
 + dequeue_task(rq, p, 0);
 + enqueue_task(rq, p, 0);
 +
 + resched_task(p);

I guess that wants to be something like check_preempt_curr()

 +}
 +
  /*
   * __normal_prio - return the priority that is based on the static prio
   */
 @@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, 
 unsigned int, len,
   return ret;
  }
  
 +#ifdef CONFIG_SCHED_HRTICK

Still wondering what all this has to do with SCHED_HRTICK..

 +/*
 + * Yield the CPU, giving the remainder of our time slice to task p.
 + * Typically used to hand CPU time to another thread inside the same
 + * process, eg. when p holds a resource other threads are waiting for.
 + * Giving priority to p may help get that resource released sooner.
 + */
 +void yield_to(struct task_struct *p)
 +{
 + unsigned long flags;
 + struct sched_entity *se = p-se;
 + struct rq *rq;
 + struct cfs_rq *cfs_rq;
 + u64 remain = slice_remain(current);
 +
 + rq = task_rq_lock(p, flags);
 + if (task_running(rq, p) || task_has_rt_policy(p))
 + goto out;

See, this all ain't nice, slice_remain() don't make no sense to be
called for !fair tasks.

Why not write:

  if (curr-sched_class == p-sched_class 
  curr-sched_class-yield_to)
curr-sched_class-yield_to(curr, p);

or something, and then implement sched_class_fair::yield_to only,
leaving it a NOP for all other classes.


Also, I think you can side-step that whole curr vs p rq-lock thing
you're doing here, by holding p's rq-lock, you've disabled IRQs in
current's task context, since -sum_exec_runtime and all are only
changed during scheduling and the scheduler_tick, disabling IRQs in its
task context pins them.

 + cfs_rq = cfs_rq_of(se);
 + se-vruntime -= remain;
 + if (se-vruntime  cfs_rq-min_vruntime)
 + se-vruntime = cfs_rq-min_vruntime;

Now here we have another problem, remain was measured in wall-time, and
then you go change a virtual time measure using that. These things are
related like:

 vt = t/weight

So you're missing a weight factor somewhere.

Also, that check against min_vruntime doesn't really make much sense.


 + requeue_task(rq, p);

Just makes me wonder why you added requeue task to begin with.. why not
simply dequeue at the top of this function, and enqueue at the tail,
like all the rest does: see rt_mutex_setprio(), set_user_nice(),
sched_move_task().

 + out:
 + task_rq_unlock(rq, flags);
 + yield();
 +}
 +EXPORT_SYMBOL(yield_to);

EXPORT_SYMBOL_GPL() pretty please, I really hate how kvm is a module and
needs to export hooks all over the core kernel :/

 +#endif
 +
  /**
   * sys_sched_yield - yield the current processor to other threads.
   *
 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index 5119b08..2a0a595 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity 
 *curr, int queued)
   */
  
  #ifdef CONFIG_SCHED_HRTICK
 +u64 slice_remain(struct task_struct *p)
 +{
 + unsigned long flags;
 + struct sched_entity *se = p-se;
 + struct cfs_rq *cfs_rq;
 + struct rq *rq;
 + u64 slice, ran;
 + s64 delta;
 +
 + rq = task_rq_lock(p, flags);
 + cfs_rq = cfs_rq_of(se);
 + slice = sched_slice(cfs_rq, se);
 + ran = se-sum_exec_runtime - se-prev_sum_exec_runtime;
 + delta = slice - ran;
 + task_rq_unlock(rq, flags);
 +
 + return max(delta, 0LL);
 +}


Right, so another approach might be to simply swap the vruntime between
curr and p.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 02:23:39PM +0100, Peter Zijlstra wrote:
 Right, so another approach might be to simply swap the vruntime between
 curr and p.

Can't that cause others to stave? For ex: consider a cpu p0 having these tasks:

p0 -  A0 B0 A1

A0/A1 have entered some sort of AB-BA spin-deadlock, as a result A0 wants to
direct yield to A1 which wants to direct yield to A0. If we keep swapping their
runtimes, can't it starve B0?

- vatsa


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 06:54:16AM +0100, Mike Galbraith wrote:
  +void yield_to(struct task_struct *p)
  +{
  +   unsigned long flags;
  +   struct sched_entity *se = p-se;
  +   struct rq *rq;
  +   struct cfs_rq *cfs_rq;
  +   u64 remain = slice_remain(current);
 
 That slice remaining only shows the distance to last preempt, however
 brief.  It shows nothing wrt tree position, the yielding task may well
 already be right of the task it wants to yield to, having been a buddy.

Good point.

  cfs_rq = cfs_rq_of(se);
  +   se-vruntime -= remain;
  +   if (se-vruntime  cfs_rq-min_vruntime)
  +   se-vruntime = cfs_rq-min_vruntime;
 
 (This is usually done using max_vruntime())
 
 If the recipient was already left of the fair stick (min_vruntime),
 clipping advances it's vruntime, vaporizing entitlement from both donor
 and recipient.
 
 What if a task tries to yield to another not on the same cpu, and/or in
 the same task group?

In this case, target of yield_to is a vcpu belonging to the same VM and hence is
expected to be in same task group, but I agree its good to put a check.

  This would munge min_vruntime of other queues.  I
 think you'd have to restrict this to same cpu, same group.  If tasks can
 donate cross cfs_rq, (say) pinned task A cpu A running solo could donate
 vruntime to selected tasks pinned to cpu B, for as long as minuscule
 preemptions can resupply ammo.  Would suck to not be the favored child.

IOW starving non-favored childs?

 Maybe you could exchange vruntimes cooperatively (iff same cfs_rq)
 between threads, but I don't think donations with clipping works.

Can't that lead to starvation again (as I pointed in a mail to Peterz):

p0 - A0 B0 A1

A0/A1 enter a yield_to(other) deadlock, which means we keep swapping their
vruntimes, starving B0?

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Fri, 2010-12-03 at 19:00 +0530, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 02:23:39PM +0100, Peter Zijlstra wrote:
  Right, so another approach might be to simply swap the vruntime between
  curr and p.
 
 Can't that cause others to stave? For ex: consider a cpu p0 having these 
 tasks:
 
 p0 -  A0 B0 A1
 
 A0/A1 have entered some sort of AB-BA spin-deadlock, as a result A0 wants to
 direct yield to A1 which wants to direct yield to A0. If we keep swapping 
 their
 runtimes, can't it starve B0?

No, because they do receive service (they spend some time spinning
before being interrupted), so the respective vruntimes will increase, at
some point they'll pass B0 and it'll get scheduled.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:
 No, because they do receive service (they spend some time spinning
 before being interrupted), so the respective vruntimes will increase, at
 some point they'll pass B0 and it'll get scheduled.

Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
case)?

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 07:36:07PM +0530, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:
  No, because they do receive service (they spend some time spinning
  before being interrupted), so the respective vruntimes will increase, at
  some point they'll pass B0 and it'll get scheduled.
 
 Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
 case)?

Hmm perhaps yes, althought at cost of tons of context switches, which would be
nice to minimize on?

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Mike Galbraith
On Fri, 2010-12-03 at 19:16 +0530, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 06:54:16AM +0100, Mike Galbraith wrote:
   +void yield_to(struct task_struct *p)
   +{
   + unsigned long flags;
   + struct sched_entity *se = p-se;
   + struct rq *rq;
   + struct cfs_rq *cfs_rq;
   + u64 remain = slice_remain(current);
  
  That slice remaining only shows the distance to last preempt, however
  brief.  It shows nothing wrt tree position, the yielding task may well
  already be right of the task it wants to yield to, having been a buddy.
 
 Good point.
 
   cfs_rq = cfs_rq_of(se);
   + se-vruntime -= remain;
   + if (se-vruntime  cfs_rq-min_vruntime)
   + se-vruntime = cfs_rq-min_vruntime;
  
  (This is usually done using max_vruntime())
  
  If the recipient was already left of the fair stick (min_vruntime),
  clipping advances it's vruntime, vaporizing entitlement from both donor
  and recipient.
  
  What if a task tries to yield to another not on the same cpu, and/or in
  the same task group?
 
 In this case, target of yield_to is a vcpu belonging to the same VM and hence 
 is
 expected to be in same task group, but I agree its good to put a check.
 
   This would munge min_vruntime of other queues.  I
  think you'd have to restrict this to same cpu, same group.  If tasks can
  donate cross cfs_rq, (say) pinned task A cpu A running solo could donate
  vruntime to selected tasks pinned to cpu B, for as long as minuscule
  preemptions can resupply ammo.  Would suck to not be the favored child.
 
 IOW starving non-favored childs?

Yes, as in fairness ceases to exists.

  Maybe you could exchange vruntimes cooperatively (iff same cfs_rq)
  between threads, but I don't think donations with clipping works.
 
 Can't that lead to starvation again (as I pointed in a mail to Peterz):
 
 p0 - A0 B0 A1
 
 A0/A1 enter a yield_to(other) deadlock, which means we keep swapping their
 vruntimes, starving B0?

I'll have to go back and re-read that.  Off the top of my head, I see no
way it could matter which container the numbers live in as long as they
keep advancing, and stay in the same runqueue.  (hm, task weights would
have to be the same too or scaled. dangerous business, tinkering with
vruntimes)

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/03/2010 09:45 AM, Mike Galbraith wrote:


I'll have to go back and re-read that.  Off the top of my head, I see no
way it could matter which container the numbers live in as long as they
keep advancing, and stay in the same runqueue.  (hm, task weights would
have to be the same too or scaled. dangerous business, tinkering with
vruntimes)


They're not necessarily in the same runqueue, the
VCPU that is given time might be on another CPU
than the one that was spinning on a lock.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Mike Galbraith
On Fri, 2010-12-03 at 09:48 -0500, Rik van Riel wrote:
 On 12/03/2010 09:45 AM, Mike Galbraith wrote:
 
  I'll have to go back and re-read that.  Off the top of my head, I see no
  way it could matter which container the numbers live in as long as they
  keep advancing, and stay in the same runqueue.  (hm, task weights would
  have to be the same too or scaled. dangerous business, tinkering with
  vruntimes)
 
 They're not necessarily in the same runqueue, the
 VCPU that is given time might be on another CPU
 than the one that was spinning on a lock.

I don't think pumping vruntime cross cfs_rq would be safe, for the
reason noted (et al).  No competition means vruntime is meaningless.
Donating just advances a clock that nobody's looking at.

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/03/2010 10:09 AM, Mike Galbraith wrote:

On Fri, 2010-12-03 at 09:48 -0500, Rik van Riel wrote:

On 12/03/2010 09:45 AM, Mike Galbraith wrote:


I'll have to go back and re-read that.  Off the top of my head, I see no
way it could matter which container the numbers live in as long as they
keep advancing, and stay in the same runqueue.  (hm, task weights would
have to be the same too or scaled. dangerous business, tinkering with
vruntimes)


They're not necessarily in the same runqueue, the
VCPU that is given time might be on another CPU
than the one that was spinning on a lock.


I don't think pumping vruntime cross cfs_rq would be safe, for the
reason noted (et al).  No competition means vruntime is meaningless.
Donating just advances a clock that nobody's looking at.


Do you have suggestions on what I should do to make
this yield_to functionality work?

I'm willing to implement pretty much anything the
scheduler people will be happy with :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 10:35:25AM -0500, Rik van Riel wrote:
 Do you have suggestions on what I should do to make
 this yield_to functionality work?

Keeping in mind the complications of yield_to, I had suggested we do something
suggested below:

http://marc.info/?l=kvmm=129122645006996w=2

Essentially yield to other tasks on your own runqueue and when you get to run
again, try reclaiming what you gave up earlier (with a cap on how much one can
feedback this relinquished time). It can be accomplished via a special form of 
yield(), available only to in-kernel users, kvm in this case.

- vatsa 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/03/2010 11:20 AM, Srivatsa Vaddagiri wrote:

On Fri, Dec 03, 2010 at 10:35:25AM -0500, Rik van Riel wrote:

Do you have suggestions on what I should do to make
this yield_to functionality work?


Keeping in mind the complications of yield_to, I had suggested we do something
suggested below:

http://marc.info/?l=kvmm=129122645006996w=2

Essentially yield to other tasks on your own runqueue and when you get to run
again, try reclaiming what you gave up earlier (with a cap on how much one can
feedback this relinquished time). It can be accomplished via a special form of
yield(), available only to in-kernel users, kvm in this case.


I don't see how that is going to help get the lock
released, when the VCPU holding the lock is on another
CPU.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 12:09:01PM -0500, Rik van Riel wrote:
 I don't see how that is going to help get the lock
 released, when the VCPU holding the lock is on another
 CPU.

Even the directed yield() is not guaranteed to get the lock released, given its
shooting in the dark?

Anyway, the intention of yield() proposed was not to get lock released
immediately (which will happen eventually), but rather to avoid inefficiency
associated with (long) spinning and at the same time make sure we are not
leaking our bandwidth to other guests because of a naive yield ..

- vatsa


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/03/2010 12:29 PM, Srivatsa Vaddagiri wrote:

On Fri, Dec 03, 2010 at 12:09:01PM -0500, Rik van Riel wrote:

I don't see how that is going to help get the lock
released, when the VCPU holding the lock is on another
CPU.


Even the directed yield() is not guaranteed to get the lock released, given its
shooting in the dark?


True, that's a fair point.


Anyway, the intention of yield() proposed was not to get lock released
immediately (which will happen eventually), but rather to avoid inefficiency
associated with (long) spinning and at the same time make sure we are not
leaking our bandwidth to other guests because of a naive yield ..


A KVM guest can run on the host alongside short-lived
processes, though.  How can we ensure that a VCPU that
donates time gets it back again later, when the task
time was donated to may no longer exist?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 12:33:29PM -0500, Rik van Riel wrote:
 Anyway, the intention of yield() proposed was not to get lock released
 immediately (which will happen eventually), but rather to avoid inefficiency
 associated with (long) spinning and at the same time make sure we are not
 leaking our bandwidth to other guests because of a naive yield ..
 
 A KVM guest can run on the host alongside short-lived
 processes, though.  How can we ensure that a VCPU that
 donates time gets it back again later, when the task
 time was donated to may no longer exist?

I think that does not matter. What matters for fairness in this case is how much
cpu time yielding thread gets over some (larger) time window. By ensuring that
relinquished time is fedback, we should maintian fairness for that particular
vcpu thread ..This also avoids nasty interactions associated with donation ..

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/02/2010 07:50 PM, Chris Wright wrote:


+void requeue_task(struct rq *rq, struct task_struct *p)
+{
+   assert_spin_locked(rq-lock);
+
+   if (!p-se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
+   return;


already checked task_running(rq, p) || task_has_rt_policy(p) w/ rq lock
held.


OK, I removed the duplicate checks.


+
+   dequeue_task(rq, p, 0);
+   enqueue_task(rq, p, 0);


seems like you could condense to save an update_rq_clock() call at least,
don't know if the info_queued, info_dequeued need to be updated


Or I can do the whole operation with the task not queued.
Not sure yet what approach I'll take...


+#ifdef CONFIG_SCHED_HRTICK
+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the same
+ * process, eg. when p holds a resource other threads are waiting for.
+ * Giving priority to p may help get that resource released sooner.
+ */
+void yield_to(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se =p-se;
+   struct rq *rq;
+   struct cfs_rq *cfs_rq;
+   u64 remain = slice_remain(current);
+
+   rq = task_rq_lock(p,flags);
+   if (task_running(rq, p) || task_has_rt_policy(p))
+   goto out;
+   cfs_rq = cfs_rq_of(se);
+   se-vruntime -= remain;
+   if (se-vruntime  cfs_rq-min_vruntime)
+   se-vruntime = cfs_rq-min_vruntime;


Should these details all be in sched_fair?  Seems like the wrong layer
here.  And would that condition go the other way?  If new vruntime is
smaller than min, then it becomes new cfs_rq-min_vruntime?


That would be nice.  Unfortunately, EXPORT_SYMBOL() does
not seem to work right from sched_fair.c, which is included
from sched.c instead of being built from the makefile!


diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5119b08..2a0a595 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity 
*curr, int queued)
   */

  #ifdef CONFIG_SCHED_HRTICK
+u64 slice_remain(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se =p-se;
+   struct cfs_rq *cfs_rq;
+   struct rq *rq;
+   u64 slice, ran;
+   s64 delta;
+
+   rq = task_rq_lock(p,flags);
+   cfs_rq = cfs_rq_of(se);
+   slice = sched_slice(cfs_rq, se);
+   ran = se-sum_exec_runtime - se-prev_sum_exec_runtime;
+   delta = slice - ran;
+   task_rq_unlock(rq,flags);
+
+   return max(delta, 0LL);


Can delta go negative?


Good question.  I don't know.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Chris Wright
* Rik van Riel (r...@redhat.com) wrote:
 On 12/02/2010 07:50 PM, Chris Wright wrote:
 +/*
 + * Yield the CPU, giving the remainder of our time slice to task p.
 + * Typically used to hand CPU time to another thread inside the same
 + * process, eg. when p holds a resource other threads are waiting for.
 + * Giving priority to p may help get that resource released sooner.
 + */
 +void yield_to(struct task_struct *p)
 +{
 +   unsigned long flags;
 +   struct sched_entity *se =p-se;
 +   struct rq *rq;
 +   struct cfs_rq *cfs_rq;
 +   u64 remain = slice_remain(current);
 +
 +   rq = task_rq_lock(p,flags);
 +   if (task_running(rq, p) || task_has_rt_policy(p))
 +   goto out;
 +   cfs_rq = cfs_rq_of(se);
 +   se-vruntime -= remain;
 +   if (se-vruntime  cfs_rq-min_vruntime)
 +   se-vruntime = cfs_rq-min_vruntime;
 
 Should these details all be in sched_fair?  Seems like the wrong layer
 here.  And would that condition go the other way?  If new vruntime is
 smaller than min, then it becomes new cfs_rq-min_vruntime?
 
 That would be nice.  Unfortunately, EXPORT_SYMBOL() does
 not seem to work right from sched_fair.c, which is included
 from sched.c instead of being built from the makefile!

add a -yield_to() to properly isolate (only relevant then in
sched_fair)?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Mike Galbraith
On Fri, 2010-12-03 at 10:35 -0500, Rik van Riel wrote:
 On 12/03/2010 10:09 AM, Mike Galbraith wrote:
  On Fri, 2010-12-03 at 09:48 -0500, Rik van Riel wrote:
  On 12/03/2010 09:45 AM, Mike Galbraith wrote:
 
  I'll have to go back and re-read that.  Off the top of my head, I see no
  way it could matter which container the numbers live in as long as they
  keep advancing, and stay in the same runqueue.  (hm, task weights would
  have to be the same too or scaled. dangerous business, tinkering with
  vruntimes)
 
  They're not necessarily in the same runqueue, the
  VCPU that is given time might be on another CPU
  than the one that was spinning on a lock.
 
  I don't think pumping vruntime cross cfs_rq would be safe, for the
  reason noted (et al).  No competition means vruntime is meaningless.
  Donating just advances a clock that nobody's looking at.
 
 Do you have suggestions on what I should do to make
 this yield_to functionality work?

Hm.

The problem with donating vruntime across queues is that there is no
global clock.  You have to be in the same frame of reference for
vruntime donation to make any sense.  Same with cross cpu yield_to() hw
wise though.  It makes no sense from another frame of reference.  Pull.

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Fri, 2010-12-03 at 16:09 +0100, Mike Galbraith wrote:
 On Fri, 2010-12-03 at 09:48 -0500, Rik van Riel wrote:
  On 12/03/2010 09:45 AM, Mike Galbraith wrote:
  
   I'll have to go back and re-read that.  Off the top of my head, I see no
   way it could matter which container the numbers live in as long as they
   keep advancing, and stay in the same runqueue.  (hm, task weights would
   have to be the same too or scaled. dangerous business, tinkering with
   vruntimes)
  
  They're not necessarily in the same runqueue, the
  VCPU that is given time might be on another CPU
  than the one that was spinning on a lock.
 
 I don't think pumping vruntime cross cfs_rq would be safe, for the
 reason noted (et al).  No competition means vruntime is meaningless.
 Donating just advances a clock that nobody's looking at.

Yeah, cross-cpu you have to model it like exchanging lag. That's a
slightly more complicated trick (esp. since we still don't have a proper
measure for lag) but it should be doable.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Fri, 2010-12-03 at 19:40 +0530, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 07:36:07PM +0530, Srivatsa Vaddagiri wrote:
  On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:
   No, because they do receive service (they spend some time spinning
   before being interrupted), so the respective vruntimes will increase, at
   some point they'll pass B0 and it'll get scheduled.
  
  Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in 
  this
  case)?
 
 Hmm perhaps yes, althought at cost of tons of context switches, which would be
 nice to minimize on?

Don't care, as long as the guys calling yield_to() pay for that time its
their problem.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Fri, 2010-12-03 at 13:27 -0500, Rik van Riel wrote:
  Should these details all be in sched_fair?  Seems like the wrong layer
  here.  And would that condition go the other way?  If new vruntime is
  smaller than min, then it becomes new cfs_rq-min_vruntime?
 
 That would be nice.  Unfortunately, EXPORT_SYMBOL() does
 not seem to work right from sched_fair.c, which is included
 from sched.c instead of being built from the makefile!

I'm not quite sure why that is, but I kinda like that, the policy
implementation should never export stuff.

Code outside the scheduler cannot ever know the policy of a task, hence
policy specific exports are bad.

A generic export with policy implementations (like the
sched_class::yield_to() proposal) are the proper way.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/3] sched: add yield_to function

2010-12-02 Thread Rik van Riel
Add a yield_to function to the scheduler code, allowing us to
give the remainder of our timeslice to another thread.

We may want to use this to provide a sys_yield_to system call
one day.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5f926c..4f3cce9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1985,6 +1985,7 @@ extern void set_user_nice(struct task_struct *p, long 
nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
 extern int can_nice(const struct task_struct *p, const int nice);
+extern void requeue_task(struct rq *rq, struct task_struct *p);
 extern int task_curr(const struct task_struct *p);
 extern int idle_cpu(int cpu);
 extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
@@ -2058,6 +2059,14 @@ extern int wake_up_state(struct task_struct *tsk, 
unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
unsigned long clone_flags);
+
+#ifdef CONFIG_SCHED_HRTICK
+extern u64 slice_remain(struct task_struct *);
+extern void yield_to(struct task_struct *);
+#else
+static inline void yield_to(struct task_struct *p) yield()
+#endif
+
 #ifdef CONFIG_SMP
  extern void kick_process(struct task_struct *tsk);
 #else
diff --git a/kernel/sched.c b/kernel/sched.c
index f8e5a25..ef088cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1909,6 +1909,26 @@ static void dequeue_task(struct rq *rq, struct 
task_struct *p, int sleep)
p-se.on_rq = 0;
 }
 
+/**
+ * requeue_task - requeue a task which priority got changed by yield_to
+ * @rq: the tasks's runqueue
+ * @p: the task in question
+ * Must be called with the runqueue lock held. Will cause the CPU to
+ * reschedule if p is now at the head of the runqueue.
+ */
+void requeue_task(struct rq *rq, struct task_struct *p)
+{
+   assert_spin_locked(rq-lock);
+
+   if (!p-se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
+   return;
+
+   dequeue_task(rq, p, 0);
+   enqueue_task(rq, p, 0);
+
+   resched_task(p);
+}
+
 /*
  * __normal_prio - return the priority that is based on the static prio
  */
@@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned 
int, len,
return ret;
 }
 
+#ifdef CONFIG_SCHED_HRTICK
+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the same
+ * process, eg. when p holds a resource other threads are waiting for.
+ * Giving priority to p may help get that resource released sooner.
+ */
+void yield_to(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se = p-se;
+   struct rq *rq;
+   struct cfs_rq *cfs_rq;
+   u64 remain = slice_remain(current);
+
+   rq = task_rq_lock(p, flags);
+   if (task_running(rq, p) || task_has_rt_policy(p))
+   goto out;
+   cfs_rq = cfs_rq_of(se);
+   se-vruntime -= remain;
+   if (se-vruntime  cfs_rq-min_vruntime)
+   se-vruntime = cfs_rq-min_vruntime;
+   requeue_task(rq, p);
+ out:
+   task_rq_unlock(rq, flags);
+   yield();
+}
+EXPORT_SYMBOL(yield_to);
+#endif
+
 /**
  * sys_sched_yield - yield the current processor to other threads.
  *
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5119b08..2a0a595 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity 
*curr, int queued)
  */
 
 #ifdef CONFIG_SCHED_HRTICK
+u64 slice_remain(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se = p-se;
+   struct cfs_rq *cfs_rq;
+   struct rq *rq;
+   u64 slice, ran;
+   s64 delta;
+
+   rq = task_rq_lock(p, flags);
+   cfs_rq = cfs_rq_of(se);
+   slice = sched_slice(cfs_rq, se);
+   ran = se-sum_exec_runtime - se-prev_sum_exec_runtime;
+   delta = slice - ran;
+   task_rq_unlock(rq, flags);
+
+   return max(delta, 0LL);
+}
+
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
struct sched_entity *se = p-se;


-- 
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-02 Thread Chris Wright
* Rik van Riel (r...@redhat.com) wrote:
 Add a yield_to function to the scheduler code, allowing us to
 give the remainder of our timeslice to another thread.
 
 We may want to use this to provide a sys_yield_to system call
 one day.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index c5f926c..4f3cce9 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1985,6 +1985,7 @@ extern void set_user_nice(struct task_struct *p, long 
 nice);
  extern int task_prio(const struct task_struct *p);
  extern int task_nice(const struct task_struct *p);
  extern int can_nice(const struct task_struct *p, const int nice);
 +extern void requeue_task(struct rq *rq, struct task_struct *p);
  extern int task_curr(const struct task_struct *p);
  extern int idle_cpu(int cpu);
  extern int sched_setscheduler(struct task_struct *, int, struct sched_param 
 *);
 @@ -2058,6 +2059,14 @@ extern int wake_up_state(struct task_struct *tsk, 
 unsigned int state);
  extern int wake_up_process(struct task_struct *tsk);
  extern void wake_up_new_task(struct task_struct *tsk,
   unsigned long clone_flags);
 +
 +#ifdef CONFIG_SCHED_HRTICK
 +extern u64 slice_remain(struct task_struct *);
 +extern void yield_to(struct task_struct *);
 +#else
 +static inline void yield_to(struct task_struct *p) yield()

Missing {}'s ?

 +#endif
 +
  #ifdef CONFIG_SMP
   extern void kick_process(struct task_struct *tsk);
  #else
 diff --git a/kernel/sched.c b/kernel/sched.c
 index f8e5a25..ef088cd 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -1909,6 +1909,26 @@ static void dequeue_task(struct rq *rq, struct 
 task_struct *p, int sleep)
   p-se.on_rq = 0;
  }
  
 +/**
 + * requeue_task - requeue a task which priority got changed by yield_to
 + * @rq: the tasks's runqueue
 + * @p: the task in question
 + * Must be called with the runqueue lock held. Will cause the CPU to
 + * reschedule if p is now at the head of the runqueue.
 + */
 +void requeue_task(struct rq *rq, struct task_struct *p)
 +{
 + assert_spin_locked(rq-lock);
 +
 + if (!p-se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
 + return;

already checked task_running(rq, p) || task_has_rt_policy(p) w/ rq lock
held.

 +
 + dequeue_task(rq, p, 0);
 + enqueue_task(rq, p, 0);

seems like you could condense to save an update_rq_clock() call at least,
don't know if the info_queued, info_dequeued need to be updated

 + resched_task(p);
 +}
 +
  /*
   * __normal_prio - return the priority that is based on the static prio
   */
 @@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, 
 unsigned int, len,
   return ret;
  }
  
 +#ifdef CONFIG_SCHED_HRTICK
 +/*
 + * Yield the CPU, giving the remainder of our time slice to task p.
 + * Typically used to hand CPU time to another thread inside the same
 + * process, eg. when p holds a resource other threads are waiting for.
 + * Giving priority to p may help get that resource released sooner.
 + */
 +void yield_to(struct task_struct *p)
 +{
 + unsigned long flags;
 + struct sched_entity *se = p-se;
 + struct rq *rq;
 + struct cfs_rq *cfs_rq;
 + u64 remain = slice_remain(current);
 +
 + rq = task_rq_lock(p, flags);
 + if (task_running(rq, p) || task_has_rt_policy(p))
 + goto out;
 + cfs_rq = cfs_rq_of(se);
 + se-vruntime -= remain;
 + if (se-vruntime  cfs_rq-min_vruntime)
 + se-vruntime = cfs_rq-min_vruntime;

Should these details all be in sched_fair?  Seems like the wrong layer
here.  And would that condition go the other way?  If new vruntime is
smaller than min, then it becomes new cfs_rq-min_vruntime?

 + requeue_task(rq, p);
 + out:
 + task_rq_unlock(rq, flags);
 + yield();
 +}
 +EXPORT_SYMBOL(yield_to);
 +#endif
 +
  /**
   * sys_sched_yield - yield the current processor to other threads.
   *
 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index 5119b08..2a0a595 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity 
 *curr, int queued)
   */
  
  #ifdef CONFIG_SCHED_HRTICK
 +u64 slice_remain(struct task_struct *p)
 +{
 + unsigned long flags;
 + struct sched_entity *se = p-se;
 + struct cfs_rq *cfs_rq;
 + struct rq *rq;
 + u64 slice, ran;
 + s64 delta;
 +
 + rq = task_rq_lock(p, flags);
 + cfs_rq = cfs_rq_of(se);
 + slice = sched_slice(cfs_rq, se);
 + ran = se-sum_exec_runtime - se-prev_sum_exec_runtime;
 + delta = slice - ran;
 + task_rq_unlock(rq, flags);
 +
 + return max(delta, 0LL);

Can delta go negative?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-02 Thread Mike Galbraith
On Thu, 2010-12-02 at 14:44 -0500, Rik van Riel wrote:

 +#ifdef CONFIG_SCHED_HRTICK
 +/*
 + * Yield the CPU, giving the remainder of our time slice to task p.
 + * Typically used to hand CPU time to another thread inside the same
 + * process, eg. when p holds a resource other threads are waiting for.
 + * Giving priority to p may help get that resource released sooner.
 + */
 +void yield_to(struct task_struct *p)
 +{
 + unsigned long flags;
 + struct sched_entity *se = p-se;
 + struct rq *rq;
 + struct cfs_rq *cfs_rq;
 + u64 remain = slice_remain(current);

That slice remaining only shows the distance to last preempt, however
brief.  It shows nothing wrt tree position, the yielding task may well
already be right of the task it wants to yield to, having been a buddy.
 
 cfs_rq = cfs_rq_of(se);
 + se-vruntime -= remain;
 + if (se-vruntime  cfs_rq-min_vruntime)
 + se-vruntime = cfs_rq-min_vruntime;

(This is usually done using max_vruntime())

If the recipient was already left of the fair stick (min_vruntime),
clipping advances it's vruntime, vaporizing entitlement from both donor
and recipient.

What if a task tries to yield to another not on the same cpu, and/or in
the same task group?  This would munge min_vruntime of other queues.  I
think you'd have to restrict this to same cpu, same group.  If tasks can
donate cross cfs_rq, (say) pinned task A cpu A running solo could donate
vruntime to selected tasks pinned to cpu B, for as long as minuscule
preemptions can resupply ammo.  Would suck to not be the favored child.

Maybe you could exchange vruntimes cooperatively (iff same cfs_rq)
between threads, but I don't think donations with clipping works.

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html