Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-26 Thread Steven Rostedt
On Thu, 26 Feb 2015 14:47:54 +0100
Peter Zijlstra  wrote:

> > Thinking about this more, is it because a wmb just forces the CPU to
> > write everything before this before it writes anything after it. That
> > is, the writes themselves can happen at a much later time. Does a plain
> > mb() work the same way if there are no reads required?
> 
> No, neither smp_wmb nor smp_mb are required to flush the store buffers.

Heh, that's what I said :-)  "That is, the writes themselves can happen
at a much later time."

> 
> The only thing barriers do is guarantee order, this can be done by
> flushing store buffers but it can also be done by making sure store
> buffers flush writes in the 'right' order.
> 
> Nor does an rmb help anything with ordering against a possible store
> buffer flush. Again rmb only guarantees two loads are issued in that
> particular order, it doesn't disallow the CPU speculating the load at
> all.

Yep understood.

> > What about using atomic_t?
> > 
> > Note, my latest code doesn't have any of this, but I just want to
> > understand the semantics of these operations a bit better.
> 
> Nope, atomic_t doesn't help here either. Atomics only make sure the RmW
> cycle is atomic.

Crummy. ;-)

> 
> Note that even if wmb or mb did flush the store buffer, you would still
> have a race here.

Oh, it wasn't that I meant to remove the race. I was just trying to
make that race smaller.

But this is all academic now, as my last version doesn't do any of this.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-26 Thread Peter Zijlstra
On Thu, Feb 26, 2015 at 07:43:01AM -0500, Steven Rostedt wrote:
> On Thu, 26 Feb 2015 08:45:59 +0100
> Peter Zijlstra  wrote:
> 
> > On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
> > > It can't be used for state?
> > > 
> > > If one CPU writes "zero", and the other CPU wants to decide if the
> > > system is in the state to do something, isn't a rmb() fine to use?
> > > 
> > > 
> > > CPU 1:
> > > 
> > >   x = 0;
> > >   /* Tell other CPUs they can now do something */
> > >   smp_wmb();
> > > 
> > > CPU 2:
> > >   /* Make sure we see current state of x */
> > >   smp_rmb();
> > >   if (x == 0)
> > >   do_something();
> > > 
> > > The above situation is not acceptable?
> > 
> > Acceptable is just not the word. It plain doesn't work that way.
> 
> Thinking about this more, is it because a wmb just forces the CPU to
> write everything before this before it writes anything after it. That
> is, the writes themselves can happen at a much later time. Does a plain
> mb() work the same way if there are no reads required?

No, neither smp_wmb nor smp_mb are required to flush the store buffers.

The only thing barriers do is guarantee order, this can be done by
flushing store buffers but it can also be done by making sure store
buffers flush writes in the 'right' order.

Nor does an rmb help anything with ordering against a possible store
buffer flush. Again rmb only guarantees two loads are issued in that
particular order, it doesn't disallow the CPU speculating the load at
all.

So that load of X could come out of thin air, or a year ago, or
whatever, definitely before any store buffers were, or were not, flushed
on another cpu.

> > > Otherwise, we fail to be able to do_something() when it is perfectly
> > > fine to do so.
> > 
> > Can't be helped.
> 
> What about using atomic_t?
> 
> Note, my latest code doesn't have any of this, but I just want to
> understand the semantics of these operations a bit better.

Nope, atomic_t doesn't help here either. Atomics only make sure the RmW
cycle is atomic.

Note that even if wmb or mb did flush the store buffer, you would still
have a race here.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-26 Thread Steven Rostedt
On Thu, 26 Feb 2015 08:49:07 +0100
Peter Zijlstra  wrote:
 
> Yes, notice that we don't start iterating at the beginning; this in on
> purpose. If we start iterating at the beginning, _every_ cpu will again
> pile up on the first one.
> 
> By starting at the current cpu, each cpu will start iteration some place
> else and hopefully, with a big enough system, different CPUs end up on a
> different rto cpu.

Note, v3 doesn't start at the current CPU, it starts at the original
CPU again. As the start of every IPI comes from a different CPU, this
still keeps a restart from clobbering the same CPU.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-26 Thread Steven Rostedt
On Thu, 26 Feb 2015 08:45:59 +0100
Peter Zijlstra  wrote:

> On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
> > It can't be used for state?
> > 
> > If one CPU writes "zero", and the other CPU wants to decide if the
> > system is in the state to do something, isn't a rmb() fine to use?
> > 
> > 
> > CPU 1:
> > 
> > x = 0;
> > /* Tell other CPUs they can now do something */
> > smp_wmb();
> > 
> > CPU 2:
> > /* Make sure we see current state of x */
> > smp_rmb();
> > if (x == 0)
> > do_something();
> > 
> > The above situation is not acceptable?
> 
> Acceptable is just not the word. It plain doesn't work that way.

Thinking about this more, is it because a wmb just forces the CPU to
write everything before this before it writes anything after it. That
is, the writes themselves can happen at a much later time. Does a plain
mb() work the same way if there are no reads required?

> 
> > Otherwise, we fail to be able to do_something() when it is perfectly
> > fine to do so.
> 
> Can't be helped.

What about using atomic_t?

Note, my latest code doesn't have any of this, but I just want to
understand the semantics of these operations a bit better.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-26 Thread Steven Rostedt
On Thu, 26 Feb 2015 08:49:07 +0100
Peter Zijlstra pet...@infradead.org wrote:
 
 Yes, notice that we don't start iterating at the beginning; this in on
 purpose. If we start iterating at the beginning, _every_ cpu will again
 pile up on the first one.
 
 By starting at the current cpu, each cpu will start iteration some place
 else and hopefully, with a big enough system, different CPUs end up on a
 different rto cpu.

Note, v3 doesn't start at the current CPU, it starts at the original
CPU again. As the start of every IPI comes from a different CPU, this
still keeps a restart from clobbering the same CPU.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-26 Thread Steven Rostedt
On Thu, 26 Feb 2015 14:47:54 +0100
Peter Zijlstra pet...@infradead.org wrote:

  Thinking about this more, is it because a wmb just forces the CPU to
  write everything before this before it writes anything after it. That
  is, the writes themselves can happen at a much later time. Does a plain
  mb() work the same way if there are no reads required?
 
 No, neither smp_wmb nor smp_mb are required to flush the store buffers.

Heh, that's what I said :-)  That is, the writes themselves can happen
at a much later time.

 
 The only thing barriers do is guarantee order, this can be done by
 flushing store buffers but it can also be done by making sure store
 buffers flush writes in the 'right' order.
 
 Nor does an rmb help anything with ordering against a possible store
 buffer flush. Again rmb only guarantees two loads are issued in that
 particular order, it doesn't disallow the CPU speculating the load at
 all.

Yep understood.

  What about using atomic_t?
  
  Note, my latest code doesn't have any of this, but I just want to
  understand the semantics of these operations a bit better.
 
 Nope, atomic_t doesn't help here either. Atomics only make sure the RmW
 cycle is atomic.

Crummy. ;-)

 
 Note that even if wmb or mb did flush the store buffer, you would still
 have a race here.

Oh, it wasn't that I meant to remove the race. I was just trying to
make that race smaller.

But this is all academic now, as my last version doesn't do any of this.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-26 Thread Steven Rostedt
On Thu, 26 Feb 2015 08:45:59 +0100
Peter Zijlstra pet...@infradead.org wrote:

 On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
  It can't be used for state?
  
  If one CPU writes zero, and the other CPU wants to decide if the
  system is in the state to do something, isn't a rmb() fine to use?
  
  
  CPU 1:
  
  x = 0;
  /* Tell other CPUs they can now do something */
  smp_wmb();
  
  CPU 2:
  /* Make sure we see current state of x */
  smp_rmb();
  if (x == 0)
  do_something();
  
  The above situation is not acceptable?
 
 Acceptable is just not the word. It plain doesn't work that way.

Thinking about this more, is it because a wmb just forces the CPU to
write everything before this before it writes anything after it. That
is, the writes themselves can happen at a much later time. Does a plain
mb() work the same way if there are no reads required?

 
  Otherwise, we fail to be able to do_something() when it is perfectly
  fine to do so.
 
 Can't be helped.

What about using atomic_t?

Note, my latest code doesn't have any of this, but I just want to
understand the semantics of these operations a bit better.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-26 Thread Peter Zijlstra
On Thu, Feb 26, 2015 at 07:43:01AM -0500, Steven Rostedt wrote:
 On Thu, 26 Feb 2015 08:45:59 +0100
 Peter Zijlstra pet...@infradead.org wrote:
 
  On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
   It can't be used for state?
   
   If one CPU writes zero, and the other CPU wants to decide if the
   system is in the state to do something, isn't a rmb() fine to use?
   
   
   CPU 1:
   
 x = 0;
 /* Tell other CPUs they can now do something */
 smp_wmb();
   
   CPU 2:
 /* Make sure we see current state of x */
 smp_rmb();
 if (x == 0)
 do_something();
   
   The above situation is not acceptable?
  
  Acceptable is just not the word. It plain doesn't work that way.
 
 Thinking about this more, is it because a wmb just forces the CPU to
 write everything before this before it writes anything after it. That
 is, the writes themselves can happen at a much later time. Does a plain
 mb() work the same way if there are no reads required?

No, neither smp_wmb nor smp_mb are required to flush the store buffers.

The only thing barriers do is guarantee order, this can be done by
flushing store buffers but it can also be done by making sure store
buffers flush writes in the 'right' order.

Nor does an rmb help anything with ordering against a possible store
buffer flush. Again rmb only guarantees two loads are issued in that
particular order, it doesn't disallow the CPU speculating the load at
all.

So that load of X could come out of thin air, or a year ago, or
whatever, definitely before any store buffers were, or were not, flushed
on another cpu.

   Otherwise, we fail to be able to do_something() when it is perfectly
   fine to do so.
  
  Can't be helped.
 
 What about using atomic_t?
 
 Note, my latest code doesn't have any of this, but I just want to
 understand the semantics of these operations a bit better.

Nope, atomic_t doesn't help here either. Atomics only make sure the RmW
cycle is atomic.

Note that even if wmb or mb did flush the store buffer, you would still
have a race here.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
> > Well, the problem with it is one of collisions. So the 'easy' solution I
> > proposed would be something like:
> > 
> > int ips_next(struct ipi_pull_struct *ips)
> > {
> > int cpu = ips->src_cpu;
> > cpu = cpumask_next(cpu, rto_mask);
> > if (cpu >= nr_cpu_ids) {
> 
> Do we really need to loop? Just start with the first one, and go to the
> end.
> 
> > cpu = 0;
> > ips->flags |= IPS_LOOPED;
> > cpu = cpumask_next(cpu, rto_mask);
> > if (cpu >= nr_cpu_ids) /* empty mask *;
> > return cpu;
> > }
> > if (ips->flags & IPS_LOOPED && cpu >= ips->stop_cpu)
> > return nr_cpu_ids;
> > return cpu;
> > }

Yes, notice that we don't start iterating at the beginning; this in on
purpose. If we start iterating at the beginning, _every_ cpu will again
pile up on the first one.

By starting at the current cpu, each cpu will start iteration some place
else and hopefully, with a big enough system, different CPUs end up on a
different rto cpu.

> > 
> > 
> > struct ipi_pull_struct *ips = __this_cpu_ptr(ips);
> > 
> > raw_spin_lock(>lock);
> > if (ips->flags & IPS_BUSY) {
> > /* there is an IPI active; update state */
> > ips->dst_prio = current->prio;
> > ips->stop_cpu = ips->src_cpu;
> > ips->flags &= ~IPS_LOOPED;
> 
> I guess the loop is needed for continuing the work, in case the
> scheduling changed?

That too.

> > } else {
> > /* no IPI active, make one go */
> > ips->dst_cpu = smp_processor_id();
> > ips->dst_prio = current->prio;
> > ips->src_cpu = ips->dst_cpu;
> > ips->stop_cpu = ips->dst_cpu;
> > ips->flags = IPS_BUSY;
> > 
> > cpu = ips_next(ips);
> > ips->src_cpu = cpu;
> > if (cpu < nr_cpu_ids)
> > irq_work_queue_on(>work, cpu);
> > }
> > raw_spin_unlock(>lock);
> 
> I'll have to spend some time comprehending this.

:-)

> > Where you would simply start walking the RTO mask from the current
> > position -- it also includes some restart logic, and you'd only take
> > ips->lock when your ipi handler starts and when it needs to migrate to
> > another cpu.
> > 
> > This way, on big systems, there's at least some chance different CPUs
> > find different targets to pull from.
> 
> OK, makes sense. I can try that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
> It can't be used for state?
> 
> If one CPU writes "zero", and the other CPU wants to decide if the
> system is in the state to do something, isn't a rmb() fine to use?
> 
> 
> CPU 1:
> 
>   x = 0;
>   /* Tell other CPUs they can now do something */
>   smp_wmb();
> 
> CPU 2:
>   /* Make sure we see current state of x */
>   smp_rmb();
>   if (x == 0)
>   do_something();
> 
> The above situation is not acceptable?

Acceptable is just not the word. It plain doesn't work that way.

> Otherwise, we fail to be able to do_something() when it is perfectly
> fine to do so.

Can't be helped.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Steven Rostedt
On Wed, 25 Feb 2015 18:11:10 +0100
Peter Zijlstra  wrote:

>
> > Actually, this is a speed up, but if exiting a irq handler is also a
> > full barrier, then it is not needed. 
> 
> Barrieres are _NEVER_ about speedups, they're either required or not.

"speed up" was a wrong word. "correct state" is what I should have said.

I meant "speed up" the getting of the "correct state" :-)

> 
> > Here's what I have:
> > 
> > rmb();
> > if (!pending)
> > send ipi;
> > 
> > 
> > The above is to make sure we see the pending bit before making the
> > decision. It would suck if it was already cleared, but because we never
> > flushed the cache, that it always sees it as pending.
> 
> barriers are not about flushing store buffers (although some
> implementations might do so we cannot rely on that in general).
> 
> The above rmb is entirely without function; as stated you need at least
> two loads for an rmb to make sense.

It can't be used for state?

If one CPU writes "zero", and the other CPU wants to decide if the
system is in the state to do something, isn't a rmb() fine to use?


CPU 1:

x = 0;
/* Tell other CPUs they can now do something */
smp_wmb();

CPU 2:
/* Make sure we see current state of x */
smp_rmb();
if (x == 0)
do_something();

The above situation is not acceptable?

Otherwise, we fail to be able to do_something() when it is perfectly
fine to do so.


> 
> > > > @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
> > > >  */
> > > > smp_rmb();
> > > >  
> > > > +   /* Use local just in case a feature is switched in the middle 
> > > > of this */
> > > > +   if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
> > > > +   /* Make sure the update to pending is visible here. */
> > > > +   smp_rmb();
> > > 
> > > Another dubious barriers; the sched_feat() 'LOAD' cannot matter,
> > > therefore this barries doesn't add any order over the previous rmb.
> > 
> > I added it before noticing that there was an rmb() above :-)
> > 
> > > 
> > > And again, you need two loads for an RMB to make sense, and a WMB (or
> > > MB) on the other side with some STORE.
> > > 
> > > I think this refers to the store of push_csd_pending at the tail of
> > > try_to_push_task(), but as there's no order there, there cannot be any
> > > here either.
> > 
> > Right, but it's more of making sure that the state of the system is
> > correct at that point than a correctness thing.
> 
> -ENOPARSE, there appear to be two different definitions of correct going
> about in the above sentence.
> 
> Outside of arch code (and rarely there) should we use barriers for
> anything other than ordering guarantees. And there is no 'order' in a
> single variable.

Can they not be used for passing of state as mentioned above?

> 
> > > > +
> > > > +   /* If a push ipi is out, then we must do the old method 
> > > > */
> > > > +   push_pending = READ_ONCE(this_rq->rt.push_csd_pending);
> > > > +   }
> > > 
> > > Hmm, this deserves a wee bit more comment I think.
> > > 
> > > Ideally you'd be able to 'cancel' the active IPI and re-issue it.
> > 
> > Right, but 'canceling' is really racy. The thing is, there's no locks
> > out there to serialize this. And due to the nature of this code, we
> > don't want any locks either.
> > 
> > Since I was using smp_call_function I also could not start one until
> > the previous one was finished, as it uses the same data structures to
> > do the send.
> > 
> > Even if I switch this to irq_work, I still have no good way to
> > cancel it without introducing a race. How do I cancel it and know that
> > it is canceled before starting a new ipi? I'm not sure I can use the
> > same irq_work struct for two calls of queue_work_on().
> > 
> > I tried several ways but always found a new mole to whack. I decided to
> > punt and just fall back to the old "safe" way if this case happens. It
> > seldom does even on stress tests.
> > 
> > If you can think of a safe way to do this, please enlighten me :-)
> 
> You could put a lock around your IPI state; that is:
> 
> #define IPS_BUSY  0x01
> #define IPS_LOOPED0x02
> 
> struct ipi_pull_struct {
>   struct irq_work work;
>   raw_spinlock_t  lock;
>   int flags;
>   int dst_cpu;/* CPU that issued this search */
>   int dst_prio;   /* Prio of the originating CPU */
>   int src_cpu;/* CPU where the IPI is running */
>   int stop_cpu;   /* cpu where we can stop iterating */
>   ...
> };
> 
> Where you can increase seq every time you change it and the IPI will
> continue at least one more rto mask round for every change?
> 
> > > Hohumm,.. so there's still a problem here, and you created it by looking
> > > for the highest prio overloaded cpu.
> > > 
> > > This means that if multiple CPUs need to go pull, they'll end up sending 
> > > IPIs
> > > to the same highest overloaded CPU.
> > 

Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Wed, Feb 25, 2015 at 06:11:10PM +0100, Peter Zijlstra wrote:
> 
> #define IPS_BUSY  0x01
> #define IPS_LOOPED0x02
> 
> struct ipi_pull_struct {
>   struct irq_work work;
>   raw_spinlock_t  lock;
>   int flags;
>   int dst_cpu;/* CPU that issued this search */
>   int dst_prio;   /* Prio of the originating CPU */
>   int src_cpu;/* CPU where the IPI is running */
>   int stop_cpu;   /* cpu where we can stop iterating */
>   ...
> };
> 
> Where you can increase seq every time you change it and the IPI will
> continue at least one more rto mask round for every change?

seq got lost.. ignore that bit, look at the code.

> int ips_next(struct ipi_pull_struct *ips)
> {
>   int cpu = ips->src_cpu;
>   cpu = cpumask_next(cpu, rto_mask);
>   if (cpu >= nr_cpu_ids) {
>   cpu = 0;

should be -1

>   ips->flags |= IPS_LOOPED;
>   cpu = cpumask_next(cpu, rto_mask);
>   if (cpu >= nr_cpu_ids) /* empty mask *;
>   return cpu;
>   }
>   if (ips->flags & IPS_LOOPED && cpu >= ips->stop_cpu)
>   return nr_cpu_ids;
>   return cpu;
> }
> 
> 
> 
>   struct ipi_pull_struct *ips = __this_cpu_ptr(ips);
> 
>   raw_spin_lock(>lock);
>   if (ips->flags & IPS_BUSY) {
>   /* there is an IPI active; update state */
>   ips->dst_prio = current->prio;
>   ips->stop_cpu = ips->src_cpu;
>   ips->flags &= ~IPS_LOOPED;
>   } else {
>   /* no IPI active, make one go */
>   ips->dst_cpu = smp_processor_id();
>   ips->dst_prio = current->prio;
>   ips->src_cpu = ips->dst_cpu;
>   ips->stop_cpu = ips->dst_cpu;
>   ips->flags = IPS_BUSY;
> 
>   cpu = ips_next(ips);
>   ips->src_cpu = cpu;
>   if (cpu < nr_cpu_ids)
>   irq_work_queue_on(>work, cpu);
>   }
>   raw_spin_unlock(>lock);

There might be a wee race vs the busy looping, seeing how src_cpu might
be checking against the old dst_prio, so we could check one too few rto
mask.

No real problem I think.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Wed, Feb 25, 2015 at 10:51:16AM -0500, Steven Rostedt wrote:
> > > +static void try_to_push_tasks(void *arg)
> > > +{
> > > + struct rt_rq *rt_rq = arg;
> > > + struct rq *rq, *next_rq;
> > > + int next_cpu = -1;
> > > + int next_prio = MAX_PRIO + 1;
> > > + int this_prio;
> > > + int src_prio;
> > > + int prio;
> > > + int this_cpu;
> > > + int success;
> > > + int cpu;
> > > +
> > > + /* Make sure we can see csd_cpu */
> > > + smp_rmb();
> > 
> > As per the above, interrupt are serializing, this is not needed.
> 
> Because entering an interrupt is a serializing point with other CPUs?

https://lkml.org/lkml/2009/2/18/145

So raising IPI vs receiving one should be serialized. Also, both APIs
(irq_work_queue_on() and smp_call_function_single()) should guarantee
this even if we didn't require the architecture to provide this.

> > > + /*
> > > +  * We use the priority that determined to send to this CPU
> > > +  * even if the priority for this CPU changed. This is used
> > > +  * to determine what other CPUs to send to, to keep from
> > > +  * doing a ping pong from each CPU.
> > > +  */
> > > + this_prio = rt_rq->push_csd_prio;
> > > + src_prio = rt_rq->highest_prio.curr;
> > 
> > Should we, at this point, not check if the condition that triggered the
> > pull request is still valid on our src cpu? No point in continuing our
> > IPI chain if the CPU we're looking for work for is no longer interested
> > in it.
> 
> But how do we know that?

Dunno, I didn't say I really thought about it ;-)

> I added logic here to do exactly that, but then realized I need to keep
> track of too much information. The pull happens when the rq schedules a
> task of lower priority. That new task can still be an RT task. To know
> we do not need to do the pull, we need to keep track of what the
> original priority was.
> 
> Hmm, but that said, we can add an optimization here and do this:
> 
>   if (src_prio <= this_prio)
>   goto done;
> 
> As "this_prio" is what we saw that we could pull. Thus, if the rq that
> is asking to do the pull schedules a task that is equal or higher in
> priority than the highest rq it sent a pull request for, then we do not
> need to continue this IPI.
> 
> I'll add that.

Yeah something like that might work. You could also put in a stop flag,
which the IPI handler would check before propagating.


> > > + /* Make sure the next cpu is seen on remote CPU */
> > > + smp_mb();
> > 
> > Not required,
> 
> Because irq_work_queue_on() is a full barrier, right?

Lets say yes :-)

> > > +done:
> > > + rt_rq->push_csd_pending = 0;
> > > +
> > > + /* Now make sure the src CPU can see this update */
> > > + smp_wmb();
> > 
> > 
> > This barrier does not make sense either, for a wmb to make sense you
> > need two stores:
> > 
> > x := 0, y := 0
> > 
> > [S] x = 1
> > WMB
> > [S] y = 1
> > 
> > And one would typically pair that with some loads like:
> > 
> > [L] r1 = y
> > RMB
> > [L] r2 = x
> > 
> > Which gives you the guarantee that if r1 is true, l2 must then also be
> > true.
> > 
> > How in this case, there is no subsequent store.. therefore there is no
> > order and therefore there is no use of barriers.
> 
> Actually, this is a speed up, but if exiting a irq handler is also a
> full barrier, then it is not needed. 

Barrieres are _NEVER_ about speedups, they're either required or not.

> Here's what I have:
> 
>   rmb();
>   if (!pending)
>   send ipi;
> 
> 
> The above is to make sure we see the pending bit before making the
> decision. It would suck if it was already cleared, but because we never
> flushed the cache, that it always sees it as pending.

barriers are not about flushing store buffers (although some
implementations might do so we cannot rely on that in general).

The above rmb is entirely without function; as stated you need at least
two loads for an rmb to make sense.

> > > @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
> > >*/
> > >   smp_rmb();
> > >  
> > > + /* Use local just in case a feature is switched in the middle of this */
> > > + if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
> > > + /* Make sure the update to pending is visible here. */
> > > + smp_rmb();
> > 
> > Another dubious barriers; the sched_feat() 'LOAD' cannot matter,
> > therefore this barries doesn't add any order over the previous rmb.
> 
> I added it before noticing that there was an rmb() above :-)
> 
> > 
> > And again, you need two loads for an RMB to make sense, and a WMB (or
> > MB) on the other side with some STORE.
> > 
> > I think this refers to the store of push_csd_pending at the tail of
> > try_to_push_task(), but as there's no order there, there cannot be any
> > here either.
> 
> Right, but it's more of making sure that the state of the system is
> correct at that point than a correctness thing.

-ENOPARSE, there appear to be two different definitions of correct going
about in the above sentence.

Outside 

Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Steven Rostedt
On Wed, 25 Feb 2015 11:35:35 +0100
Peter Zijlstra  wrote:

> On Tue, Feb 24, 2015 at 01:39:46PM -0500, Steven Rostedt wrote:
> > Index: linux-rt.git/kernel/sched/rt.c
> > ===
> > --- linux-rt.git.orig/kernel/sched/rt.c 2015-02-24 10:44:08.798785452 
> > -0500
> > +++ linux-rt.git/kernel/sched/rt.c  2015-02-24 13:07:20.154735448 -0500
> 
> > @@ -1760,11 +1771,171 @@ static void push_rt_tasks(struct rq *rq)
> > ;
> >  }
> >  
> > +static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
> > +{
> > +   /* We should never be here if the IPI is already out. */
> > +   BUG_ON(rq->rt.push_csd_pending);
> > +
> > +   rq->rt.push_csd_pending = 1;
> > +   rq->rt.push_csd_cpu = cpu;
> > +   /* Save the prio that we used to find this CPU */
> > +   rq->rt.push_csd_prio = prio;
> > +
> > +   /* Make sure csd_cpu is visible before we send the ipi */
> > +   smp_mb();
> 
> We've architecturally defined the raising of interrupts vs the execution
> of the handler to be a serializing event, therefore this full barrier is
> not required.
> 
> I think we documented it in places, but I never can find it.
> 
> > +
> > +   smp_call_function_single_async(cpu, >rt.push_csd);
> 
> I'm confused why are you mixing smp_call_function_single_async() and
> irq_work_queue_on() in the same logic?

I originally started with smp_call_function_single_async(), but found
that it takes csd locks and holds them while the functions are
executed. Meaning, you can not call another
smp_call_function_single_async() from within a previous call. I left
the code that starts the IPI as the smp_call_function but changed the
internal calls to use irq_queue_work_on(). I guess I can switch all to
just use that if you want only one, as the
smp_call_function_single_async() is not even an option.

> 
> Pick one and stick to it.
> 
> Also: 
> lkml.kernel.org/r/ca+55afz492bzlfhdbkn-hygjcreup7cjmeyk3ntsfrwjppz...@mail.gmail.com

Ah that would have helped. I talked about doing this too in the past.
I think I mentioned it to you on IRC.

But I believe this still requires the user making sure the original
call is synchronized, which breaks making a change for you below. I'll
discuss that further down.

> 
> Now I would suggest you use irq_wok_queue_on() for this, consistently,
> because irq_works are ran after the smp function calls complete and
> therefore minimize the latency for people waiting on the (sync) smp
> function calls.

Well, we still want the pull to happen as fast as possible. But sure,
I get your point.

> 
> > +}
> > +
> > +static void try_to_push_tasks(void *arg)
> > +{
> > +   struct rt_rq *rt_rq = arg;
> > +   struct rq *rq, *next_rq;
> > +   int next_cpu = -1;
> > +   int next_prio = MAX_PRIO + 1;
> > +   int this_prio;
> > +   int src_prio;
> > +   int prio;
> > +   int this_cpu;
> > +   int success;
> > +   int cpu;
> > +
> > +   /* Make sure we can see csd_cpu */
> > +   smp_rmb();
> 
> As per the above, interrupt are serializing, this is not needed.

Because entering an interrupt is a serializing point with other CPUs?

> 
> > +
> > +   this_cpu = rt_rq->push_csd_cpu;
> > +
> > +   /* Paranoid check */
> > +   BUG_ON(this_cpu != smp_processor_id());
> > +
> > +   rq = cpu_rq(this_cpu);
> > +
> > +   /*
> > +* If there's nothing to push here, then see if another queue
> > +* can push instead.
> > +*/
> > +   if (!has_pushable_tasks(rq))
> > +   goto pass_the_ipi;
> > +
> > +   raw_spin_lock(>lock);
> > +   success = push_rt_task(rq);
> > +   raw_spin_unlock(>lock);
> > +
> > +   if (success)
> > +   goto done;
> > +
> > +   /* Nothing was pushed, try another queue */
> > +pass_the_ipi:
> > +
> > +   /*
> > +* We use the priority that determined to send to this CPU
> > +* even if the priority for this CPU changed. This is used
> > +* to determine what other CPUs to send to, to keep from
> > +* doing a ping pong from each CPU.
> > +*/
> > +   this_prio = rt_rq->push_csd_prio;
> > +   src_prio = rt_rq->highest_prio.curr;
> 
> Should we, at this point, not check if the condition that triggered the
> pull request is still valid on our src cpu? No point in continuing our
> IPI chain if the CPU we're looking for work for is no longer interested
> in it.

But how do we know that?

I added logic here to do exactly that, but then realized I need to keep
track of too much information. The pull happens when the rq schedules a
task of lower priority. That new task can still be an RT task. To know
we do not need to do the pull, we need to keep track of what the
original priority was.

Hmm, but that said, we can add an optimization here and do this:

if (src_prio <= this_prio)
goto done;

As "this_prio" is what we saw that we could pull. Thus, if the rq that
is asking to do the pull schedules a task that is equal or higher in
priority than the highest rq it sent a pull request for, then we do not
need 

Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Tue, Feb 24, 2015 at 01:39:46PM -0500, Steven Rostedt wrote:
> Index: linux-rt.git/kernel/sched/rt.c
> ===
> --- linux-rt.git.orig/kernel/sched/rt.c   2015-02-24 10:44:08.798785452 
> -0500
> +++ linux-rt.git/kernel/sched/rt.c2015-02-24 13:07:20.154735448 -0500

> @@ -1760,11 +1771,171 @@ static void push_rt_tasks(struct rq *rq)
>   ;
>  }
>  
> +static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
> +{
> + /* We should never be here if the IPI is already out. */
> + BUG_ON(rq->rt.push_csd_pending);
> +
> + rq->rt.push_csd_pending = 1;
> + rq->rt.push_csd_cpu = cpu;
> + /* Save the prio that we used to find this CPU */
> + rq->rt.push_csd_prio = prio;
> +
> + /* Make sure csd_cpu is visible before we send the ipi */
> + smp_mb();

We've architecturally defined the raising of interrupts vs the execution
of the handler to be a serializing event, therefore this full barrier is
not required.

I think we documented it in places, but I never can find it.

> +
> + smp_call_function_single_async(cpu, >rt.push_csd);

I'm confused why are you mixing smp_call_function_single_async() and
irq_work_queue_on() in the same logic?

Pick one and stick to it.

Also: 
lkml.kernel.org/r/ca+55afz492bzlfhdbkn-hygjcreup7cjmeyk3ntsfrwjppz...@mail.gmail.com

Now I would suggest you use irq_wok_queue_on() for this, consistently,
because irq_works are ran after the smp function calls complete and
therefore minimize the latency for people waiting on the (sync) smp
function calls.

> +}
> +
> +static void try_to_push_tasks(void *arg)
> +{
> + struct rt_rq *rt_rq = arg;
> + struct rq *rq, *next_rq;
> + int next_cpu = -1;
> + int next_prio = MAX_PRIO + 1;
> + int this_prio;
> + int src_prio;
> + int prio;
> + int this_cpu;
> + int success;
> + int cpu;
> +
> + /* Make sure we can see csd_cpu */
> + smp_rmb();

As per the above, interrupt are serializing, this is not needed.

> +
> + this_cpu = rt_rq->push_csd_cpu;
> +
> + /* Paranoid check */
> + BUG_ON(this_cpu != smp_processor_id());
> +
> + rq = cpu_rq(this_cpu);
> +
> + /*
> +  * If there's nothing to push here, then see if another queue
> +  * can push instead.
> +  */
> + if (!has_pushable_tasks(rq))
> + goto pass_the_ipi;
> +
> + raw_spin_lock(>lock);
> + success = push_rt_task(rq);
> + raw_spin_unlock(>lock);
> +
> + if (success)
> + goto done;
> +
> + /* Nothing was pushed, try another queue */
> +pass_the_ipi:
> +
> + /*
> +  * We use the priority that determined to send to this CPU
> +  * even if the priority for this CPU changed. This is used
> +  * to determine what other CPUs to send to, to keep from
> +  * doing a ping pong from each CPU.
> +  */
> + this_prio = rt_rq->push_csd_prio;
> + src_prio = rt_rq->highest_prio.curr;

Should we, at this point, not check if the condition that triggered the
pull request is still valid on our src cpu? No point in continuing our
IPI chain if the CPU we're looking for work for is no longer interested
in it.

> + for_each_cpu(cpu, rq->rd->rto_mask) {
> + if (this_cpu == cpu)
> + continue;
> +
> + /*
> +  * This function was called because some rq lowered its
> +  * priority. It then searched for the highest priority
> +  * rq that had overloaded tasks and sent an smp function
> +  * call to that cpu to call this function to push its
> +  * tasks. But when it got here, the task was either
> +  * already pushed, or due to affinity, could not move
> +  * the overloaded task.
> +  *
> +  * Now we need to see if there's another overloaded rq that
> +  * has an RT task that can migrate to that CPU.
> +  *
> +  * We need to be careful, we do not want to cause a ping
> +  * pong between this CPU and another CPU that has an RT task
> +  * that can migrate, but not to the CPU that lowered its
> +  * priority. Since the lowering priority CPU finds the highest
> +  * priority rq to send to, we will ignore any rq that is of 
> higher
> +  * priority than this current one. 

Maybe start a new paragraph here?

   That is, if a rq scheduled a
> +  * task of higher priority, the schedule itself would do the
> +  * push or pull then. We can safely ignore higher priority rqs.
> +  * And if there's one that is the same priority, since the CPUS
> +  * are searched in order we will ignore CPUS of the same 
> priority
> +  * unless the CPU number is greater than this CPU's number.
> +  */

I would s/CPUS/CPUs/ the plural is not part of the 

Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Steven Rostedt
On Wed, 25 Feb 2015 11:35:35 +0100
Peter Zijlstra pet...@infradead.org wrote:

 On Tue, Feb 24, 2015 at 01:39:46PM -0500, Steven Rostedt wrote:
  Index: linux-rt.git/kernel/sched/rt.c
  ===
  --- linux-rt.git.orig/kernel/sched/rt.c 2015-02-24 10:44:08.798785452 
  -0500
  +++ linux-rt.git/kernel/sched/rt.c  2015-02-24 13:07:20.154735448 -0500
 
  @@ -1760,11 +1771,171 @@ static void push_rt_tasks(struct rq *rq)
  ;
   }
   
  +static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
  +{
  +   /* We should never be here if the IPI is already out. */
  +   BUG_ON(rq-rt.push_csd_pending);
  +
  +   rq-rt.push_csd_pending = 1;
  +   rq-rt.push_csd_cpu = cpu;
  +   /* Save the prio that we used to find this CPU */
  +   rq-rt.push_csd_prio = prio;
  +
  +   /* Make sure csd_cpu is visible before we send the ipi */
  +   smp_mb();
 
 We've architecturally defined the raising of interrupts vs the execution
 of the handler to be a serializing event, therefore this full barrier is
 not required.
 
 I think we documented it in places, but I never can find it.
 
  +
  +   smp_call_function_single_async(cpu, rq-rt.push_csd);
 
 I'm confused why are you mixing smp_call_function_single_async() and
 irq_work_queue_on() in the same logic?

I originally started with smp_call_function_single_async(), but found
that it takes csd locks and holds them while the functions are
executed. Meaning, you can not call another
smp_call_function_single_async() from within a previous call. I left
the code that starts the IPI as the smp_call_function but changed the
internal calls to use irq_queue_work_on(). I guess I can switch all to
just use that if you want only one, as the
smp_call_function_single_async() is not even an option.

 
 Pick one and stick to it.
 
 Also: 
 lkml.kernel.org/r/ca+55afz492bzlfhdbkn-hygjcreup7cjmeyk3ntsfrwjppz...@mail.gmail.com

Ah that would have helped. I talked about doing this too in the past.
I think I mentioned it to you on IRC.

But I believe this still requires the user making sure the original
call is synchronized, which breaks making a change for you below. I'll
discuss that further down.

 
 Now I would suggest you use irq_wok_queue_on() for this, consistently,
 because irq_works are ran after the smp function calls complete and
 therefore minimize the latency for people waiting on the (sync) smp
 function calls.

Well, we still want the pull to happen as fast as possible. But sure,
I get your point.

 
  +}
  +
  +static void try_to_push_tasks(void *arg)
  +{
  +   struct rt_rq *rt_rq = arg;
  +   struct rq *rq, *next_rq;
  +   int next_cpu = -1;
  +   int next_prio = MAX_PRIO + 1;
  +   int this_prio;
  +   int src_prio;
  +   int prio;
  +   int this_cpu;
  +   int success;
  +   int cpu;
  +
  +   /* Make sure we can see csd_cpu */
  +   smp_rmb();
 
 As per the above, interrupt are serializing, this is not needed.

Because entering an interrupt is a serializing point with other CPUs?

 
  +
  +   this_cpu = rt_rq-push_csd_cpu;
  +
  +   /* Paranoid check */
  +   BUG_ON(this_cpu != smp_processor_id());
  +
  +   rq = cpu_rq(this_cpu);
  +
  +   /*
  +* If there's nothing to push here, then see if another queue
  +* can push instead.
  +*/
  +   if (!has_pushable_tasks(rq))
  +   goto pass_the_ipi;
  +
  +   raw_spin_lock(rq-lock);
  +   success = push_rt_task(rq);
  +   raw_spin_unlock(rq-lock);
  +
  +   if (success)
  +   goto done;
  +
  +   /* Nothing was pushed, try another queue */
  +pass_the_ipi:
  +
  +   /*
  +* We use the priority that determined to send to this CPU
  +* even if the priority for this CPU changed. This is used
  +* to determine what other CPUs to send to, to keep from
  +* doing a ping pong from each CPU.
  +*/
  +   this_prio = rt_rq-push_csd_prio;
  +   src_prio = rt_rq-highest_prio.curr;
 
 Should we, at this point, not check if the condition that triggered the
 pull request is still valid on our src cpu? No point in continuing our
 IPI chain if the CPU we're looking for work for is no longer interested
 in it.

But how do we know that?

I added logic here to do exactly that, but then realized I need to keep
track of too much information. The pull happens when the rq schedules a
task of lower priority. That new task can still be an RT task. To know
we do not need to do the pull, we need to keep track of what the
original priority was.

Hmm, but that said, we can add an optimization here and do this:

if (src_prio = this_prio)
goto done;

As this_prio is what we saw that we could pull. Thus, if the rq that
is asking to do the pull schedules a task that is equal or higher in
priority than the highest rq it sent a pull request for, then we do not
need to continue this IPI.

I'll add that.


 
  +   for_each_cpu(cpu, rq-rd-rto_mask) {
  +   if (this_cpu == cpu)
  +   continue;
  +
  + 

Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Steven Rostedt
On Wed, 25 Feb 2015 18:11:10 +0100
Peter Zijlstra pet...@infradead.org wrote:


  Actually, this is a speed up, but if exiting a irq handler is also a
  full barrier, then it is not needed. 
 
 Barrieres are _NEVER_ about speedups, they're either required or not.

speed up was a wrong word. correct state is what I should have said.

I meant speed up the getting of the correct state :-)

 
  Here's what I have:
  
  rmb();
  if (!pending)
  send ipi;
  
  
  The above is to make sure we see the pending bit before making the
  decision. It would suck if it was already cleared, but because we never
  flushed the cache, that it always sees it as pending.
 
 barriers are not about flushing store buffers (although some
 implementations might do so we cannot rely on that in general).
 
 The above rmb is entirely without function; as stated you need at least
 two loads for an rmb to make sense.

It can't be used for state?

If one CPU writes zero, and the other CPU wants to decide if the
system is in the state to do something, isn't a rmb() fine to use?


CPU 1:

x = 0;
/* Tell other CPUs they can now do something */
smp_wmb();

CPU 2:
/* Make sure we see current state of x */
smp_rmb();
if (x == 0)
do_something();

The above situation is not acceptable?

Otherwise, we fail to be able to do_something() when it is perfectly
fine to do so.


 
@@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
 */
smp_rmb();
 
+   /* Use local just in case a feature is switched in the middle 
of this */
+   if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
+   /* Make sure the update to pending is visible here. */
+   smp_rmb();
   
   Another dubious barriers; the sched_feat() 'LOAD' cannot matter,
   therefore this barries doesn't add any order over the previous rmb.
  
  I added it before noticing that there was an rmb() above :-)
  
   
   And again, you need two loads for an RMB to make sense, and a WMB (or
   MB) on the other side with some STORE.
   
   I think this refers to the store of push_csd_pending at the tail of
   try_to_push_task(), but as there's no order there, there cannot be any
   here either.
  
  Right, but it's more of making sure that the state of the system is
  correct at that point than a correctness thing.
 
 -ENOPARSE, there appear to be two different definitions of correct going
 about in the above sentence.
 
 Outside of arch code (and rarely there) should we use barriers for
 anything other than ordering guarantees. And there is no 'order' in a
 single variable.

Can they not be used for passing of state as mentioned above?

 
+
+   /* If a push ipi is out, then we must do the old method 
*/
+   push_pending = READ_ONCE(this_rq-rt.push_csd_pending);
+   }
   
   Hmm, this deserves a wee bit more comment I think.
   
   Ideally you'd be able to 'cancel' the active IPI and re-issue it.
  
  Right, but 'canceling' is really racy. The thing is, there's no locks
  out there to serialize this. And due to the nature of this code, we
  don't want any locks either.
  
  Since I was using smp_call_function I also could not start one until
  the previous one was finished, as it uses the same data structures to
  do the send.
  
  Even if I switch this to irq_work, I still have no good way to
  cancel it without introducing a race. How do I cancel it and know that
  it is canceled before starting a new ipi? I'm not sure I can use the
  same irq_work struct for two calls of queue_work_on().
  
  I tried several ways but always found a new mole to whack. I decided to
  punt and just fall back to the old safe way if this case happens. It
  seldom does even on stress tests.
  
  If you can think of a safe way to do this, please enlighten me :-)
 
 You could put a lock around your IPI state; that is:
 
 #define IPS_BUSY  0x01
 #define IPS_LOOPED0x02
 
 struct ipi_pull_struct {
   struct irq_work work;
   raw_spinlock_t  lock;
   int flags;
   int dst_cpu;/* CPU that issued this search */
   int dst_prio;   /* Prio of the originating CPU */
   int src_cpu;/* CPU where the IPI is running */
   int stop_cpu;   /* cpu where we can stop iterating */
   ...
 };
 
 Where you can increase seq every time you change it and the IPI will
 continue at least one more rto mask round for every change?
 
   Hohumm,.. so there's still a problem here, and you created it by looking
   for the highest prio overloaded cpu.
   
   This means that if multiple CPUs need to go pull, they'll end up sending 
   IPIs
   to the same highest overloaded CPU.
  
  Sure, and if you like, I'll post timings of the cost of this.
  
  I can add tests to try to trigger the worse case scenario, and we can
  see what this actually is. So far, it is much better than the current
  everyone 

Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Wed, Feb 25, 2015 at 10:51:16AM -0500, Steven Rostedt wrote:
   +static void try_to_push_tasks(void *arg)
   +{
   + struct rt_rq *rt_rq = arg;
   + struct rq *rq, *next_rq;
   + int next_cpu = -1;
   + int next_prio = MAX_PRIO + 1;
   + int this_prio;
   + int src_prio;
   + int prio;
   + int this_cpu;
   + int success;
   + int cpu;
   +
   + /* Make sure we can see csd_cpu */
   + smp_rmb();
  
  As per the above, interrupt are serializing, this is not needed.
 
 Because entering an interrupt is a serializing point with other CPUs?

https://lkml.org/lkml/2009/2/18/145

So raising IPI vs receiving one should be serialized. Also, both APIs
(irq_work_queue_on() and smp_call_function_single()) should guarantee
this even if we didn't require the architecture to provide this.

   + /*
   +  * We use the priority that determined to send to this CPU
   +  * even if the priority for this CPU changed. This is used
   +  * to determine what other CPUs to send to, to keep from
   +  * doing a ping pong from each CPU.
   +  */
   + this_prio = rt_rq-push_csd_prio;
   + src_prio = rt_rq-highest_prio.curr;
  
  Should we, at this point, not check if the condition that triggered the
  pull request is still valid on our src cpu? No point in continuing our
  IPI chain if the CPU we're looking for work for is no longer interested
  in it.
 
 But how do we know that?

Dunno, I didn't say I really thought about it ;-)

 I added logic here to do exactly that, but then realized I need to keep
 track of too much information. The pull happens when the rq schedules a
 task of lower priority. That new task can still be an RT task. To know
 we do not need to do the pull, we need to keep track of what the
 original priority was.
 
 Hmm, but that said, we can add an optimization here and do this:
 
   if (src_prio = this_prio)
   goto done;
 
 As this_prio is what we saw that we could pull. Thus, if the rq that
 is asking to do the pull schedules a task that is equal or higher in
 priority than the highest rq it sent a pull request for, then we do not
 need to continue this IPI.
 
 I'll add that.

Yeah something like that might work. You could also put in a stop flag,
which the IPI handler would check before propagating.


   + /* Make sure the next cpu is seen on remote CPU */
   + smp_mb();
  
  Not required,
 
 Because irq_work_queue_on() is a full barrier, right?

Lets say yes :-)

   +done:
   + rt_rq-push_csd_pending = 0;
   +
   + /* Now make sure the src CPU can see this update */
   + smp_wmb();
  
  
  This barrier does not make sense either, for a wmb to make sense you
  need two stores:
  
  x := 0, y := 0
  
  [S] x = 1
  WMB
  [S] y = 1
  
  And one would typically pair that with some loads like:
  
  [L] r1 = y
  RMB
  [L] r2 = x
  
  Which gives you the guarantee that if r1 is true, l2 must then also be
  true.
  
  How in this case, there is no subsequent store.. therefore there is no
  order and therefore there is no use of barriers.
 
 Actually, this is a speed up, but if exiting a irq handler is also a
 full barrier, then it is not needed. 

Barrieres are _NEVER_ about speedups, they're either required or not.

 Here's what I have:
 
   rmb();
   if (!pending)
   send ipi;
 
 
 The above is to make sure we see the pending bit before making the
 decision. It would suck if it was already cleared, but because we never
 flushed the cache, that it always sees it as pending.

barriers are not about flushing store buffers (although some
implementations might do so we cannot rely on that in general).

The above rmb is entirely without function; as stated you need at least
two loads for an rmb to make sense.

   @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
  */
 smp_rmb();

   + /* Use local just in case a feature is switched in the middle of this */
   + if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
   + /* Make sure the update to pending is visible here. */
   + smp_rmb();
  
  Another dubious barriers; the sched_feat() 'LOAD' cannot matter,
  therefore this barries doesn't add any order over the previous rmb.
 
 I added it before noticing that there was an rmb() above :-)
 
  
  And again, you need two loads for an RMB to make sense, and a WMB (or
  MB) on the other side with some STORE.
  
  I think this refers to the store of push_csd_pending at the tail of
  try_to_push_task(), but as there's no order there, there cannot be any
  here either.
 
 Right, but it's more of making sure that the state of the system is
 correct at that point than a correctness thing.

-ENOPARSE, there appear to be two different definitions of correct going
about in the above sentence.

Outside of arch code (and rarely there) should we use barriers for
anything other than ordering guarantees. And there is no 'order' in a
single variable.

   +
   + /* If a push ipi is out, then we must do the old method */
   + 

Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Wed, Feb 25, 2015 at 06:11:10PM +0100, Peter Zijlstra wrote:
 
 #define IPS_BUSY  0x01
 #define IPS_LOOPED0x02
 
 struct ipi_pull_struct {
   struct irq_work work;
   raw_spinlock_t  lock;
   int flags;
   int dst_cpu;/* CPU that issued this search */
   int dst_prio;   /* Prio of the originating CPU */
   int src_cpu;/* CPU where the IPI is running */
   int stop_cpu;   /* cpu where we can stop iterating */
   ...
 };
 
 Where you can increase seq every time you change it and the IPI will
 continue at least one more rto mask round for every change?

seq got lost.. ignore that bit, look at the code.

 int ips_next(struct ipi_pull_struct *ips)
 {
   int cpu = ips-src_cpu;
   cpu = cpumask_next(cpu, rto_mask);
   if (cpu = nr_cpu_ids) {
   cpu = 0;

should be -1

   ips-flags |= IPS_LOOPED;
   cpu = cpumask_next(cpu, rto_mask);
   if (cpu = nr_cpu_ids) /* empty mask *;
   return cpu;
   }
   if (ips-flags  IPS_LOOPED  cpu = ips-stop_cpu)
   return nr_cpu_ids;
   return cpu;
 }
 
 
 
   struct ipi_pull_struct *ips = __this_cpu_ptr(ips);
 
   raw_spin_lock(ips-lock);
   if (ips-flags  IPS_BUSY) {
   /* there is an IPI active; update state */
   ips-dst_prio = current-prio;
   ips-stop_cpu = ips-src_cpu;
   ips-flags = ~IPS_LOOPED;
   } else {
   /* no IPI active, make one go */
   ips-dst_cpu = smp_processor_id();
   ips-dst_prio = current-prio;
   ips-src_cpu = ips-dst_cpu;
   ips-stop_cpu = ips-dst_cpu;
   ips-flags = IPS_BUSY;
 
   cpu = ips_next(ips);
   ips-src_cpu = cpu;
   if (cpu  nr_cpu_ids)
   irq_work_queue_on(ips-work, cpu);
   }
   raw_spin_unlock(ips-lock);

There might be a wee race vs the busy looping, seeing how src_cpu might
be checking against the old dst_prio, so we could check one too few rto
mask.

No real problem I think.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
 It can't be used for state?
 
 If one CPU writes zero, and the other CPU wants to decide if the
 system is in the state to do something, isn't a rmb() fine to use?
 
 
 CPU 1:
 
   x = 0;
   /* Tell other CPUs they can now do something */
   smp_wmb();
 
 CPU 2:
   /* Make sure we see current state of x */
   smp_rmb();
   if (x == 0)
   do_something();
 
 The above situation is not acceptable?

Acceptable is just not the word. It plain doesn't work that way.

 Otherwise, we fail to be able to do_something() when it is perfectly
 fine to do so.

Can't be helped.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
  Well, the problem with it is one of collisions. So the 'easy' solution I
  proposed would be something like:
  
  int ips_next(struct ipi_pull_struct *ips)
  {
  int cpu = ips-src_cpu;
  cpu = cpumask_next(cpu, rto_mask);
  if (cpu = nr_cpu_ids) {
 
 Do we really need to loop? Just start with the first one, and go to the
 end.
 
  cpu = 0;
  ips-flags |= IPS_LOOPED;
  cpu = cpumask_next(cpu, rto_mask);
  if (cpu = nr_cpu_ids) /* empty mask *;
  return cpu;
  }
  if (ips-flags  IPS_LOOPED  cpu = ips-stop_cpu)
  return nr_cpu_ids;
  return cpu;
  }

Yes, notice that we don't start iterating at the beginning; this in on
purpose. If we start iterating at the beginning, _every_ cpu will again
pile up on the first one.

By starting at the current cpu, each cpu will start iteration some place
else and hopefully, with a big enough system, different CPUs end up on a
different rto cpu.

  
  
  struct ipi_pull_struct *ips = __this_cpu_ptr(ips);
  
  raw_spin_lock(ips-lock);
  if (ips-flags  IPS_BUSY) {
  /* there is an IPI active; update state */
  ips-dst_prio = current-prio;
  ips-stop_cpu = ips-src_cpu;
  ips-flags = ~IPS_LOOPED;
 
 I guess the loop is needed for continuing the work, in case the
 scheduling changed?

That too.

  } else {
  /* no IPI active, make one go */
  ips-dst_cpu = smp_processor_id();
  ips-dst_prio = current-prio;
  ips-src_cpu = ips-dst_cpu;
  ips-stop_cpu = ips-dst_cpu;
  ips-flags = IPS_BUSY;
  
  cpu = ips_next(ips);
  ips-src_cpu = cpu;
  if (cpu  nr_cpu_ids)
  irq_work_queue_on(ips-work, cpu);
  }
  raw_spin_unlock(ips-lock);
 
 I'll have to spend some time comprehending this.

:-)

  Where you would simply start walking the RTO mask from the current
  position -- it also includes some restart logic, and you'd only take
  ips-lock when your ipi handler starts and when it needs to migrate to
  another cpu.
  
  This way, on big systems, there's at least some chance different CPUs
  find different targets to pull from.
 
 OK, makes sense. I can try that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-25 Thread Peter Zijlstra
On Tue, Feb 24, 2015 at 01:39:46PM -0500, Steven Rostedt wrote:
 Index: linux-rt.git/kernel/sched/rt.c
 ===
 --- linux-rt.git.orig/kernel/sched/rt.c   2015-02-24 10:44:08.798785452 
 -0500
 +++ linux-rt.git/kernel/sched/rt.c2015-02-24 13:07:20.154735448 -0500

 @@ -1760,11 +1771,171 @@ static void push_rt_tasks(struct rq *rq)
   ;
  }
  
 +static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
 +{
 + /* We should never be here if the IPI is already out. */
 + BUG_ON(rq-rt.push_csd_pending);
 +
 + rq-rt.push_csd_pending = 1;
 + rq-rt.push_csd_cpu = cpu;
 + /* Save the prio that we used to find this CPU */
 + rq-rt.push_csd_prio = prio;
 +
 + /* Make sure csd_cpu is visible before we send the ipi */
 + smp_mb();

We've architecturally defined the raising of interrupts vs the execution
of the handler to be a serializing event, therefore this full barrier is
not required.

I think we documented it in places, but I never can find it.

 +
 + smp_call_function_single_async(cpu, rq-rt.push_csd);

I'm confused why are you mixing smp_call_function_single_async() and
irq_work_queue_on() in the same logic?

Pick one and stick to it.

Also: 
lkml.kernel.org/r/ca+55afz492bzlfhdbkn-hygjcreup7cjmeyk3ntsfrwjppz...@mail.gmail.com

Now I would suggest you use irq_wok_queue_on() for this, consistently,
because irq_works are ran after the smp function calls complete and
therefore minimize the latency for people waiting on the (sync) smp
function calls.

 +}
 +
 +static void try_to_push_tasks(void *arg)
 +{
 + struct rt_rq *rt_rq = arg;
 + struct rq *rq, *next_rq;
 + int next_cpu = -1;
 + int next_prio = MAX_PRIO + 1;
 + int this_prio;
 + int src_prio;
 + int prio;
 + int this_cpu;
 + int success;
 + int cpu;
 +
 + /* Make sure we can see csd_cpu */
 + smp_rmb();

As per the above, interrupt are serializing, this is not needed.

 +
 + this_cpu = rt_rq-push_csd_cpu;
 +
 + /* Paranoid check */
 + BUG_ON(this_cpu != smp_processor_id());
 +
 + rq = cpu_rq(this_cpu);
 +
 + /*
 +  * If there's nothing to push here, then see if another queue
 +  * can push instead.
 +  */
 + if (!has_pushable_tasks(rq))
 + goto pass_the_ipi;
 +
 + raw_spin_lock(rq-lock);
 + success = push_rt_task(rq);
 + raw_spin_unlock(rq-lock);
 +
 + if (success)
 + goto done;
 +
 + /* Nothing was pushed, try another queue */
 +pass_the_ipi:
 +
 + /*
 +  * We use the priority that determined to send to this CPU
 +  * even if the priority for this CPU changed. This is used
 +  * to determine what other CPUs to send to, to keep from
 +  * doing a ping pong from each CPU.
 +  */
 + this_prio = rt_rq-push_csd_prio;
 + src_prio = rt_rq-highest_prio.curr;

Should we, at this point, not check if the condition that triggered the
pull request is still valid on our src cpu? No point in continuing our
IPI chain if the CPU we're looking for work for is no longer interested
in it.

 + for_each_cpu(cpu, rq-rd-rto_mask) {
 + if (this_cpu == cpu)
 + continue;
 +
 + /*
 +  * This function was called because some rq lowered its
 +  * priority. It then searched for the highest priority
 +  * rq that had overloaded tasks and sent an smp function
 +  * call to that cpu to call this function to push its
 +  * tasks. But when it got here, the task was either
 +  * already pushed, or due to affinity, could not move
 +  * the overloaded task.
 +  *
 +  * Now we need to see if there's another overloaded rq that
 +  * has an RT task that can migrate to that CPU.
 +  *
 +  * We need to be careful, we do not want to cause a ping
 +  * pong between this CPU and another CPU that has an RT task
 +  * that can migrate, but not to the CPU that lowered its
 +  * priority. Since the lowering priority CPU finds the highest
 +  * priority rq to send to, we will ignore any rq that is of 
 higher
 +  * priority than this current one. 

Maybe start a new paragraph here?

   That is, if a rq scheduled a
 +  * task of higher priority, the schedule itself would do the
 +  * push or pull then. We can safely ignore higher priority rqs.
 +  * And if there's one that is the same priority, since the CPUS
 +  * are searched in order we will ignore CPUS of the same 
 priority
 +  * unless the CPU number is greater than this CPU's number.
 +  */

I would s/CPUS/CPUs/ the plural is not part of the acronym is it?


 + next_rq = cpu_rq(cpu);
 +
 + /* Use a single read for the 

Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-24 Thread Hillf Danton
> +static void try_to_push_tasks(void *arg)
> +{
> + struct rt_rq *rt_rq = arg;
> + struct rq *rq, *next_rq;
> + int next_cpu = -1;
> + int next_prio = MAX_PRIO + 1;
> + int this_prio;
> + int src_prio;
> + int prio;
> + int this_cpu;
> + int success;
> + int cpu;
> +
> + /* Make sure we can see csd_cpu */
> + smp_rmb();
> +
> + this_cpu = rt_rq->push_csd_cpu;
> +
> + /* Paranoid check */
> + BUG_ON(this_cpu != smp_processor_id());
> +
> + rq = cpu_rq(this_cpu);
> +
> + /*
> +  * If there's nothing to push here, then see if another queue
> +  * can push instead.
> +  */
> + if (!has_pushable_tasks(rq))
> + goto pass_the_ipi;
> +
> + raw_spin_lock(>lock);
> + success = push_rt_task(rq);
> + raw_spin_unlock(>lock);
> +
> + if (success)
> + goto done;

The latency, 150us over a 20 hour run, goes up if we goto done directly?
Hillf
> +
> + /* Nothing was pushed, try another queue */
> +pass_the_ipi:
> +
> + /*
> +  * We use the priority that determined to send to this CPU
> +  * even if the priority for this CPU changed. This is used
> +  * to determine what other CPUs to send to, to keep from
> +  * doing a ping pong from each CPU.
> +  */
> + this_prio = rt_rq->push_csd_prio;
> + src_prio = rt_rq->highest_prio.curr;
> +
> + for_each_cpu(cpu, rq->rd->rto_mask) {
> + if (this_cpu == cpu)
> + continue;
> +
> + /*
> +  * This function was called because some rq lowered its
> +  * priority. It then searched for the highest priority
> +  * rq that had overloaded tasks and sent an smp function
> +  * call to that cpu to call this function to push its
> +  * tasks. But when it got here, the task was either
> +  * already pushed, or due to affinity, could not move
> +  * the overloaded task.
> +  *
> +  * Now we need to see if there's another overloaded rq that
> +  * has an RT task that can migrate to that CPU.
> +  *
> +  * We need to be careful, we do not want to cause a ping
> +  * pong between this CPU and another CPU that has an RT task
> +  * that can migrate, but not to the CPU that lowered its
> +  * priority. Since the lowering priority CPU finds the highest
> +  * priority rq to send to, we will ignore any rq that is of 
> higher
> +  * priority than this current one. That is, if a rq scheduled a
> +  * task of higher priority, the schedule itself would do the
> +  * push or pull then. We can safely ignore higher priority rqs.
> +  * And if there's one that is the same priority, since the CPUS
> +  * are searched in order we will ignore CPUS of the same 
> priority
> +  * unless the CPU number is greater than this CPU's number.
> +  */
> + next_rq = cpu_rq(cpu);
> +
> + /* Use a single read for the next prio for decision making */
> + prio = READ_ONCE(next_rq->rt.highest_prio.next);
> +
> + /* Looking for highest priority */
> + if (prio >= next_prio)
> + continue;
> +
> + /* Make sure that the rq can push to the source rq */
> + if (prio >= src_prio)
> + continue;
> +
> + /* If the prio is higher than the current prio, ignore it */
> + if (prio < this_prio)
> + continue;
> +
> + /*
> +  * If the prio is equal to the current prio, only use it
> +  * if the cpu number is greater than the current cpu.
> +  * This prevents a ping pong effect.
> +  */
> + if (prio == this_prio && cpu < this_cpu)
> + continue;
> +
> + next_prio = prio;
> + next_cpu = cpu;
> + }
> +
> + /* Nothing found, do nothing */
> + if (next_cpu < 0)
> + goto done;
> +
> + /*
> +  * Now we can not send another smp async function due to locking,
> +  * use irq_work instead.
> +  */
> +
> + rt_rq->push_csd_cpu = next_cpu;
> + rt_rq->push_csd_prio = next_prio;
> +
> + /* Make sure the next cpu is seen on remote CPU */
> + smp_mb();
> +
> + irq_work_queue_on(_rq->push_csd_work, next_cpu);
> +
> + return;
> +
> +done:
> + rt_rq->push_csd_pending = 0;
> +
> + /* Now make sure the src CPU can see this update */
> + smp_wmb();
> +}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-24 Thread Steven Rostedt
On Tue, 24 Feb 2015 13:39:46 -0500
Steven Rostedt  wrote:

> @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
>*/
>   smp_rmb();
>  
> + /* Use local just in case a feature is switched in the middle of this */
> + if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
> + /* Make sure the update to pending is visible here. */
> + smp_rmb();

While porting this to the -rt kernel, I noticed that this rmb is
unneeded. It's already called above for a different reason :-p

-- Steve

> +
> + /* If a push ipi is out, then we must do the old method */
> + push_pending = READ_ONCE(this_rq->rt.push_csd_pending);
> + }
> +
>   for_each_cpu(cpu, this_rq->rd->rto_mask) {
>   if (this_cpu == cpu)
>   continue;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-24 Thread Steven Rostedt
On Tue, 24 Feb 2015 13:39:46 -0500
Steven Rostedt  wrote:
  
> +static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
> +{
> + /* We should never be here if the IPI is already out. */
> + BUG_ON(rq->rt.push_csd_pending);
> +
> + rq->rt.push_csd_pending = 1;
> + rq->rt.push_csd_cpu = cpu;
> + /* Save the prio that we used to find this CPU */
> + rq->rt.push_csd_prio = prio;
> +
> + /* Make sure csd_cpu is visible before we send the ipi */
> + smp_mb();
> +
> + smp_call_function_single_async(cpu, >rt.push_csd);
> +}



> + rt_rq->push_csd_cpu = next_cpu;
> + rt_rq->push_csd_prio = next_prio;
> +
> + /* Make sure the next cpu is seen on remote CPU */
> + smp_mb();

BTW, would a smp_wmb() be enough here. I was a little paranoid about
the irq work and smp_call_functon above some how leaking before the
barrier. I'm probably just being too paranoid, and these can be
switched to smp_wmb(). But for now I'll just keep being a freak.

-- Steve


> +
> + irq_work_queue_on(_rq->push_csd_work, next_cpu);
> +
> + return;
> +

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-24 Thread Steven Rostedt
On Tue, 24 Feb 2015 13:39:46 -0500
Steven Rostedt rost...@goodmis.org wrote:

 @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
*/
   smp_rmb();
  
 + /* Use local just in case a feature is switched in the middle of this */
 + if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
 + /* Make sure the update to pending is visible here. */
 + smp_rmb();

While porting this to the -rt kernel, I noticed that this rmb is
unneeded. It's already called above for a different reason :-p

-- Steve

 +
 + /* If a push ipi is out, then we must do the old method */
 + push_pending = READ_ONCE(this_rq-rt.push_csd_pending);
 + }
 +
   for_each_cpu(cpu, this_rq-rd-rto_mask) {
   if (this_cpu == cpu)
   continue;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-24 Thread Steven Rostedt
On Tue, 24 Feb 2015 13:39:46 -0500
Steven Rostedt rost...@goodmis.org wrote:
  
 +static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
 +{
 + /* We should never be here if the IPI is already out. */
 + BUG_ON(rq-rt.push_csd_pending);
 +
 + rq-rt.push_csd_pending = 1;
 + rq-rt.push_csd_cpu = cpu;
 + /* Save the prio that we used to find this CPU */
 + rq-rt.push_csd_prio = prio;
 +
 + /* Make sure csd_cpu is visible before we send the ipi */
 + smp_mb();
 +
 + smp_call_function_single_async(cpu, rq-rt.push_csd);
 +}



 + rt_rq-push_csd_cpu = next_cpu;
 + rt_rq-push_csd_prio = next_prio;
 +
 + /* Make sure the next cpu is seen on remote CPU */
 + smp_mb();

BTW, would a smp_wmb() be enough here. I was a little paranoid about
the irq work and smp_call_functon above some how leaking before the
barrier. I'm probably just being too paranoid, and these can be
switched to smp_wmb(). But for now I'll just keep being a freak.

-- Steve


 +
 + irq_work_queue_on(rt_rq-push_csd_work, next_cpu);
 +
 + return;
 +

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2015-02-24 Thread Hillf Danton
 +static void try_to_push_tasks(void *arg)
 +{
 + struct rt_rq *rt_rq = arg;
 + struct rq *rq, *next_rq;
 + int next_cpu = -1;
 + int next_prio = MAX_PRIO + 1;
 + int this_prio;
 + int src_prio;
 + int prio;
 + int this_cpu;
 + int success;
 + int cpu;
 +
 + /* Make sure we can see csd_cpu */
 + smp_rmb();
 +
 + this_cpu = rt_rq-push_csd_cpu;
 +
 + /* Paranoid check */
 + BUG_ON(this_cpu != smp_processor_id());
 +
 + rq = cpu_rq(this_cpu);
 +
 + /*
 +  * If there's nothing to push here, then see if another queue
 +  * can push instead.
 +  */
 + if (!has_pushable_tasks(rq))
 + goto pass_the_ipi;
 +
 + raw_spin_lock(rq-lock);
 + success = push_rt_task(rq);
 + raw_spin_unlock(rq-lock);
 +
 + if (success)
 + goto done;

The latency, 150us over a 20 hour run, goes up if we goto done directly?
Hillf
 +
 + /* Nothing was pushed, try another queue */
 +pass_the_ipi:
 +
 + /*
 +  * We use the priority that determined to send to this CPU
 +  * even if the priority for this CPU changed. This is used
 +  * to determine what other CPUs to send to, to keep from
 +  * doing a ping pong from each CPU.
 +  */
 + this_prio = rt_rq-push_csd_prio;
 + src_prio = rt_rq-highest_prio.curr;
 +
 + for_each_cpu(cpu, rq-rd-rto_mask) {
 + if (this_cpu == cpu)
 + continue;
 +
 + /*
 +  * This function was called because some rq lowered its
 +  * priority. It then searched for the highest priority
 +  * rq that had overloaded tasks and sent an smp function
 +  * call to that cpu to call this function to push its
 +  * tasks. But when it got here, the task was either
 +  * already pushed, or due to affinity, could not move
 +  * the overloaded task.
 +  *
 +  * Now we need to see if there's another overloaded rq that
 +  * has an RT task that can migrate to that CPU.
 +  *
 +  * We need to be careful, we do not want to cause a ping
 +  * pong between this CPU and another CPU that has an RT task
 +  * that can migrate, but not to the CPU that lowered its
 +  * priority. Since the lowering priority CPU finds the highest
 +  * priority rq to send to, we will ignore any rq that is of 
 higher
 +  * priority than this current one. That is, if a rq scheduled a
 +  * task of higher priority, the schedule itself would do the
 +  * push or pull then. We can safely ignore higher priority rqs.
 +  * And if there's one that is the same priority, since the CPUS
 +  * are searched in order we will ignore CPUS of the same 
 priority
 +  * unless the CPU number is greater than this CPU's number.
 +  */
 + next_rq = cpu_rq(cpu);
 +
 + /* Use a single read for the next prio for decision making */
 + prio = READ_ONCE(next_rq-rt.highest_prio.next);
 +
 + /* Looking for highest priority */
 + if (prio = next_prio)
 + continue;
 +
 + /* Make sure that the rq can push to the source rq */
 + if (prio = src_prio)
 + continue;
 +
 + /* If the prio is higher than the current prio, ignore it */
 + if (prio  this_prio)
 + continue;
 +
 + /*
 +  * If the prio is equal to the current prio, only use it
 +  * if the cpu number is greater than the current cpu.
 +  * This prevents a ping pong effect.
 +  */
 + if (prio == this_prio  cpu  this_cpu)
 + continue;
 +
 + next_prio = prio;
 + next_cpu = cpu;
 + }
 +
 + /* Nothing found, do nothing */
 + if (next_cpu  0)
 + goto done;
 +
 + /*
 +  * Now we can not send another smp async function due to locking,
 +  * use irq_work instead.
 +  */
 +
 + rt_rq-push_csd_cpu = next_cpu;
 + rt_rq-push_csd_prio = next_prio;
 +
 + /* Make sure the next cpu is seen on remote CPU */
 + smp_mb();
 +
 + irq_work_queue_on(rt_rq-push_csd_work, next_cpu);
 +
 + return;
 +
 +done:
 + rt_rq-push_csd_pending = 0;
 +
 + /* Now make sure the src CPU can see this update */
 + smp_wmb();
 +}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/