Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini


On 16/09/2015 16:16, Tejun Heo wrote:
> On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
>>> > > If the revert isn't easy, I think backporting rcu_sync is the best bet.
>> > 
>> > I leave this to Paul and Tejun... at least I think this is not v4.2 
>> > material.
> Will route reverts through cgroup branch.  Should be pretty painless.
> Nice job on percpu_rwsem.  It's so much better than having to come up
> with a separate middleground solution.

Thanks all for the quick resolution!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Tejun Heo
Hello,

On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
> > If the revert isn't easy, I think backporting rcu_sync is the best bet.
> 
> I leave this to Paul and Tejun... at least I think this is not v4.2 material.

Will route reverts through cgroup branch.  Should be pretty painless.
Nice job on percpu_rwsem.  It's so much better than having to come up
with a separate middleground solution.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 14:22 schrieb Oleg Nesterov:
>>  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.
> And percpu_up_write() doesn't do synchronize_sched() at all.
> 
>> it only
>> makes it more rare.
> 
> Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
> a single synchronize_sched().

And in fact it does. Paolo suggested to trace how often we call 
synchronize_sched so I applied some advanced printk debugging technology ;-)
Until login I have 41 and after starting the 70 guests this went up to 48.
Nice work.

> 
>> So it's possible that it isn't eliminating the root
>> cause of the problem.
> 
> We will see... Just in case, currently the usage of percpu_down_write()
> is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
> this needs some WIP changes in rcu_sync. Plus we can do more improvements,
> but this is off-topic right now.
> 
> Oleg.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Oleg Nesterov
On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 14:22, Oleg Nesterov wrote:
> > >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> >
> > Yes, but it eliminates _expedited(). This is good, but otoh this means
> > that (say) individual __cgroup_procs_write() can take much more time.
> > However, it won't block the readers and/or disturb the whole system.
>
> According to Christian, removing the _expedited() "makes things worse"

Yes sure, we can not just remove _expedited() from down/up_read().

> in that the system takes ages to boot up and systemd timeouts.

Yes, this is clear

> So I'm
> still a bit wary about anything that uses RCU for the cgroups write side.
>
> However, rcu_sync is okay with him, so perhaps it is really really
> effective.  Christian, can you instrument how many synchronize_sched
> (out of the 6479 cgroup_procs_write calls) are actually executed at boot
> with the rcu rework?

Heh, another change I have in mind. It would be nice to add some trace
points. But firstly we should merge the current code.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini


On 16/09/2015 14:22, Oleg Nesterov wrote:
> >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.

According to Christian, removing the _expedited() "makes things worse"
in that the system takes ages to boot up and systemd timeouts.  So I'm
still a bit wary about anything that uses RCU for the cgroups write side.

However, rcu_sync is okay with him, so perhaps it is really really
effective.  Christian, can you instrument how many synchronize_sched
(out of the 6479 cgroup_procs_write calls) are actually executed at boot
with the rcu rework?

Paolo

> And percpu_up_write() doesn't do synchronize_sched() at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Oleg Nesterov
On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 10:57, Christian Borntraeger wrote:
> > Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> >>
> >>
> >> On 15/09/2015 19:38, Paul E. McKenney wrote:
> >>> Excellent points!
> >>>
> >>> Other options in such situations include the following:
> >>>
> >>> o Rework so that the code uses call_rcu*() instead of *_expedited().
> >>>
> >>> o Maintain a per-task or per-CPU counter so that every so many
> >>>   *_expedited() invocations instead uses the non-expedited
> >>>   counterpart.  (For example, synchronize_rcu instead of
> >>>   synchronize_rcu_expedited().)
> >>
> >> Or just use ratelimit (untested):
> >
> > One of my tests was to always replace synchronize_sched_expedited with
> > synchronize_sched and things turned out to be even worse. Not sure if
> > it makes sense to test yopur in-the-middle approach?
>
> I don't think it applies here, since down_write/up_write is a
> synchronous API.
>
> If the revert isn't easy, I think backporting rcu_sync is the best bet.

I leave this to Paul and Tejun... at least I think this is not v4.2 material.

>  The issue is that rcu_sync doesn't eliminate synchronize_sched,

Yes, but it eliminates _expedited(). This is good, but otoh this means
that (say) individual __cgroup_procs_write() can take much more time.
However, it won't block the readers and/or disturb the whole system.
And percpu_up_write() doesn't do synchronize_sched() at all.

> it only
> makes it more rare.

Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
a single synchronize_sched().

> So it's possible that it isn't eliminating the root
> cause of the problem.

We will see... Just in case, currently the usage of percpu_down_write()
is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
this needs some WIP changes in rcu_sync. Plus we can do more improvements,
but this is off-topic right now.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 13:03 schrieb Tejun Heo:
> Hello,
> 
> On Wed, Sep 16, 2015 at 12:58:00PM +0200, Christian Borntraeger wrote:
>> FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
>> just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
>> systemd seems to do that for the processes. 
>>
>> So a revert is really the right thing to do. In fact, I dont know if the
>> rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
>> be NOT a cold/seldom case.
> 
> Booting would usually be the hottest operation for that and it's still
> *relatively* cold path compared to the reader side which is task
> fork/exit paths.  The whole point is shift overhead from hotter reader
> side.  Can you see problems with percpu_rwsem rework?

As I said, it seems the rcu tree with that change seems to work fine on my
system. This needs a more testing on other machines, though. I guess a 
revert plus a re-add in the 4.4 merge window should give us enough test
coverage.

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Tejun Heo
Hello,

On Tue, Sep 15, 2015 at 09:35:47PM -0700, Paul E. McKenney wrote:
> > > I am suggesting trying the options and seeing what works best, then
> > > working to convince people as needed.
> > 
> > Yeah, sure thing.  Let's wait for Christian.
> 
> Indeed.  Is there enough benefit to risk jamming this thing into 4.3?
> I believe that 4.4 should be a no-brainer.

In that case, I'm gonna revert the threadcgroup percpu_rwsem
conversion patches through -stable and reapply them for 4.4 merge
window.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Tejun Heo
Hello,

On Wed, Sep 16, 2015 at 12:58:00PM +0200, Christian Borntraeger wrote:
> FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
> just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
> systemd seems to do that for the processes. 
>
> So a revert is really the right thing to do. In fact, I dont know if the
> rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
> be NOT a cold/seldom case.

Booting would usually be the hottest operation for that and it's still
*relatively* cold path compared to the reader side which is task
fork/exit paths.  The whole point is shift overhead from hotter reader
side.  Can you see problems with percpu_rwsem rework?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 09:44 schrieb Christian Borntraeger:
> Am 16.09.2015 um 03:24 schrieb Tejun Heo:
>> Hello, Paul.
>>
>> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
>>> Well, the decision as to what is too big for -stable is owned by the
>>> -stable maintainers, not by me.
>>
>> Is it tho?  Usually the subsystem maintainer knows the best and has
>> most say in it.  I was mostly curious whether you'd think that the
>> changes would be too risky.  If not, great.
>>
>>> I am suggesting trying the options and seeing what works best, then
>>> working to convince people as needed.
>>
>> Yeah, sure thing.  Let's wait for Christian.
> 
> Well, I have optimized my testcase now that is puts enough pressure to
> the system to  confuses system (the older 209 version, which still has
> some event loop issues) that systemd restarts the journal deamon and does
> several other recoveries.
> To avoid regressions - even for somewhat shaky userspaces - we should
> consider a revert for 4.2 stable.
> There are several followup patches, which makes the revert non-trivial,
> though.
> 
> The rework of the percpu rwsem seems to work fine, but we are beyond the
> merge window so 4.4 seems better to me. (and consider a revert for 4.3)

FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
systemd seems to do that for the processes. 

So a revert is really the right thing to do. In fact, I dont know if the
rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
be NOT a cold/seldom case.

Christian







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini


On 16/09/2015 10:57, Christian Borntraeger wrote:
> Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
>>
>>
>> On 15/09/2015 19:38, Paul E. McKenney wrote:
>>> Excellent points!
>>>
>>> Other options in such situations include the following:
>>>
>>> o   Rework so that the code uses call_rcu*() instead of *_expedited().
>>>
>>> o   Maintain a per-task or per-CPU counter so that every so many
>>> *_expedited() invocations instead uses the non-expedited
>>> counterpart.  (For example, synchronize_rcu instead of
>>> synchronize_rcu_expedited().)
>>
>> Or just use ratelimit (untested):
> 
> One of my tests was to always replace synchronize_sched_expedited with 
> synchronize_sched and things turned out to be even worse. Not sure if
> it makes sense to test yopur in-the-middle approach?

I don't think it applies here, since down_write/up_write is a
synchronous API.

If the revert isn't easy, I think backporting rcu_sync is the best bet.
 The issue is that rcu_sync doesn't eliminate synchronize_sched, it only
makes it more rare.  So it's possible that it isn't eliminating the root
cause of the problem.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> 
> 
> On 15/09/2015 19:38, Paul E. McKenney wrote:
>> Excellent points!
>>
>> Other options in such situations include the following:
>>
>> oRework so that the code uses call_rcu*() instead of *_expedited().
>>
>> oMaintain a per-task or per-CPU counter so that every so many
>>  *_expedited() invocations instead uses the non-expedited
>>  counterpart.  (For example, synchronize_rcu instead of
>>  synchronize_rcu_expedited().)
> 
> Or just use ratelimit (untested):

One of my tests was to always replace synchronize_sched_expedited with 
synchronize_sched and things turned out to be even worse. Not sure if
it makes sense to test yopur in-the-middle approach?

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini


On 15/09/2015 19:38, Paul E. McKenney wrote:
> Excellent points!
> 
> Other options in such situations include the following:
> 
> o Rework so that the code uses call_rcu*() instead of *_expedited().
> 
> o Maintain a per-task or per-CPU counter so that every so many
>   *_expedited() invocations instead uses the non-expedited
>   counterpart.  (For example, synchronize_rcu instead of
>   synchronize_rcu_expedited().)

Or just use ratelimit (untested):

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 834c4e52cb2d..8fb66b2aeed9 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct percpu_rw_semaphore {
unsigned int __percpu   *fast_read_ctr;
@@ -13,6 +14,7 @@ struct percpu_rw_semaphore {
struct rw_semaphore rw_sem;
atomic_tslow_read_ctr;
wait_queue_head_t   write_waitq;
+   struct ratelimit_state  expedited_ratelimit;
 };
 
 extern void percpu_down_read(struct percpu_rw_semaphore *);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f32567254867..c33f8bc89384 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -20,6 +20,8 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
atomic_set(>write_ctr, 0);
atomic_set(>slow_read_ctr, 0);
init_waitqueue_head(>write_waitq);
+   /* Expedite one down_write and one up_write per second.  */
+   ratelimit_state_init(>expedited_ratelimit, HZ, 2);
return 0;
 }
 
@@ -152,7 +156,10 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
 *fast-path, it executes a full memory barrier before we return.
 *See R_W case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
+   if (__ratelimit(>expedited_ratelimit))
+   synchronize_sched_expedited();
+   else
+   synchronize_sched();
 
/* exclude other writers, and block the new readers completely */
down_write(>rw_sem);
@@ -172,7 +179,10 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
 * Insert the barrier before the next fast-path in down_read,
 * see W_R case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
+   if (__ratelimit(>expedited_ratelimit))
+   synchronize_sched_expedited();
+   else
+   synchronize_sched();
/* the last writer unblocks update_fast_ctr() */
atomic_dec(>write_ctr);
 }


> Note that synchronize_srcu_expedited() is less troublesome than are the
> other *_expedited() functions, because synchronize_srcu_expedited() does
> not inflict OS jitter on other CPUs.

Yup, synchronize_srcu_expedited() is just a busy wait and it can
complete extremely fast if you use SRCU as a "local RCU" rather
than a "sleepable RCU".  However it doesn't apply here since you
want to avoid SRCU's 2 memory barriers per lock/unlock.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 03:24 schrieb Tejun Heo:
> Hello, Paul.
> 
> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
>> Well, the decision as to what is too big for -stable is owned by the
>> -stable maintainers, not by me.
> 
> Is it tho?  Usually the subsystem maintainer knows the best and has
> most say in it.  I was mostly curious whether you'd think that the
> changes would be too risky.  If not, great.
> 
>> I am suggesting trying the options and seeing what works best, then
>> working to convince people as needed.
> 
> Yeah, sure thing.  Let's wait for Christian.

Well, I have optimized my testcase now that is puts enough pressure to
the system to  confuses system (the older 209 version, which still has
some event loop issues) that systemd restarts the journal deamon and does
several other recoveries.
To avoid regressions - even for somewhat shaky userspaces - we should
consider a revert for 4.2 stable.
There are several followup patches, which makes the revert non-trivial,
though.

The rework of the percpu rwsem seems to work fine, but we are beyond the
merge window so 4.4 seems better to me. (and consider a revert for 4.3)

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini


On 15/09/2015 19:38, Paul E. McKenney wrote:
> Excellent points!
> 
> Other options in such situations include the following:
> 
> o Rework so that the code uses call_rcu*() instead of *_expedited().
> 
> o Maintain a per-task or per-CPU counter so that every so many
>   *_expedited() invocations instead uses the non-expedited
>   counterpart.  (For example, synchronize_rcu instead of
>   synchronize_rcu_expedited().)

Or just use ratelimit (untested):

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 834c4e52cb2d..8fb66b2aeed9 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct percpu_rw_semaphore {
unsigned int __percpu   *fast_read_ctr;
@@ -13,6 +14,7 @@ struct percpu_rw_semaphore {
struct rw_semaphore rw_sem;
atomic_tslow_read_ctr;
wait_queue_head_t   write_waitq;
+   struct ratelimit_state  expedited_ratelimit;
 };
 
 extern void percpu_down_read(struct percpu_rw_semaphore *);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f32567254867..c33f8bc89384 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -20,6 +20,8 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
atomic_set(>write_ctr, 0);
atomic_set(>slow_read_ctr, 0);
init_waitqueue_head(>write_waitq);
+   /* Expedite one down_write and one up_write per second.  */
+   ratelimit_state_init(>expedited_ratelimit, HZ, 2);
return 0;
 }
 
@@ -152,7 +156,10 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
 *fast-path, it executes a full memory barrier before we return.
 *See R_W case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
+   if (__ratelimit(>expedited_ratelimit))
+   synchronize_sched_expedited();
+   else
+   synchronize_sched();
 
/* exclude other writers, and block the new readers completely */
down_write(>rw_sem);
@@ -172,7 +179,10 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
 * Insert the barrier before the next fast-path in down_read,
 * see W_R case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
+   if (__ratelimit(>expedited_ratelimit))
+   synchronize_sched_expedited();
+   else
+   synchronize_sched();
/* the last writer unblocks update_fast_ctr() */
atomic_dec(>write_ctr);
 }


> Note that synchronize_srcu_expedited() is less troublesome than are the
> other *_expedited() functions, because synchronize_srcu_expedited() does
> not inflict OS jitter on other CPUs.

Yup, synchronize_srcu_expedited() is just a busy wait and it can
complete extremely fast if you use SRCU as a "local RCU" rather
than a "sleepable RCU".  However it doesn't apply here since you
want to avoid SRCU's 2 memory barriers per lock/unlock.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> 
> 
> On 15/09/2015 19:38, Paul E. McKenney wrote:
>> Excellent points!
>>
>> Other options in such situations include the following:
>>
>> oRework so that the code uses call_rcu*() instead of *_expedited().
>>
>> oMaintain a per-task or per-CPU counter so that every so many
>>  *_expedited() invocations instead uses the non-expedited
>>  counterpart.  (For example, synchronize_rcu instead of
>>  synchronize_rcu_expedited().)
> 
> Or just use ratelimit (untested):

One of my tests was to always replace synchronize_sched_expedited with 
synchronize_sched and things turned out to be even worse. Not sure if
it makes sense to test yopur in-the-middle approach?

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 03:24 schrieb Tejun Heo:
> Hello, Paul.
> 
> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
>> Well, the decision as to what is too big for -stable is owned by the
>> -stable maintainers, not by me.
> 
> Is it tho?  Usually the subsystem maintainer knows the best and has
> most say in it.  I was mostly curious whether you'd think that the
> changes would be too risky.  If not, great.
> 
>> I am suggesting trying the options and seeing what works best, then
>> working to convince people as needed.
> 
> Yeah, sure thing.  Let's wait for Christian.

Well, I have optimized my testcase now that is puts enough pressure to
the system to  confuses system (the older 209 version, which still has
some event loop issues) that systemd restarts the journal deamon and does
several other recoveries.
To avoid regressions - even for somewhat shaky userspaces - we should
consider a revert for 4.2 stable.
There are several followup patches, which makes the revert non-trivial,
though.

The rework of the percpu rwsem seems to work fine, but we are beyond the
merge window so 4.4 seems better to me. (and consider a revert for 4.3)

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini


On 16/09/2015 10:57, Christian Borntraeger wrote:
> Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
>>
>>
>> On 15/09/2015 19:38, Paul E. McKenney wrote:
>>> Excellent points!
>>>
>>> Other options in such situations include the following:
>>>
>>> o   Rework so that the code uses call_rcu*() instead of *_expedited().
>>>
>>> o   Maintain a per-task or per-CPU counter so that every so many
>>> *_expedited() invocations instead uses the non-expedited
>>> counterpart.  (For example, synchronize_rcu instead of
>>> synchronize_rcu_expedited().)
>>
>> Or just use ratelimit (untested):
> 
> One of my tests was to always replace synchronize_sched_expedited with 
> synchronize_sched and things turned out to be even worse. Not sure if
> it makes sense to test yopur in-the-middle approach?

I don't think it applies here, since down_write/up_write is a
synchronous API.

If the revert isn't easy, I think backporting rcu_sync is the best bet.
 The issue is that rcu_sync doesn't eliminate synchronize_sched, it only
makes it more rare.  So it's possible that it isn't eliminating the root
cause of the problem.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Tejun Heo
Hello,

On Wed, Sep 16, 2015 at 12:58:00PM +0200, Christian Borntraeger wrote:
> FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
> just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
> systemd seems to do that for the processes. 
>
> So a revert is really the right thing to do. In fact, I dont know if the
> rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
> be NOT a cold/seldom case.

Booting would usually be the hottest operation for that and it's still
*relatively* cold path compared to the reader side which is task
fork/exit paths.  The whole point is shift overhead from hotter reader
side.  Can you see problems with percpu_rwsem rework?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 09:44 schrieb Christian Borntraeger:
> Am 16.09.2015 um 03:24 schrieb Tejun Heo:
>> Hello, Paul.
>>
>> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
>>> Well, the decision as to what is too big for -stable is owned by the
>>> -stable maintainers, not by me.
>>
>> Is it tho?  Usually the subsystem maintainer knows the best and has
>> most say in it.  I was mostly curious whether you'd think that the
>> changes would be too risky.  If not, great.
>>
>>> I am suggesting trying the options and seeing what works best, then
>>> working to convince people as needed.
>>
>> Yeah, sure thing.  Let's wait for Christian.
> 
> Well, I have optimized my testcase now that is puts enough pressure to
> the system to  confuses system (the older 209 version, which still has
> some event loop issues) that systemd restarts the journal deamon and does
> several other recoveries.
> To avoid regressions - even for somewhat shaky userspaces - we should
> consider a revert for 4.2 stable.
> There are several followup patches, which makes the revert non-trivial,
> though.
> 
> The rework of the percpu rwsem seems to work fine, but we are beyond the
> merge window so 4.4 seems better to me. (and consider a revert for 4.3)

FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
systemd seems to do that for the processes. 

So a revert is really the right thing to do. In fact, I dont know if the
rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
be NOT a cold/seldom case.

Christian







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Tejun Heo
Hello,

On Tue, Sep 15, 2015 at 09:35:47PM -0700, Paul E. McKenney wrote:
> > > I am suggesting trying the options and seeing what works best, then
> > > working to convince people as needed.
> > 
> > Yeah, sure thing.  Let's wait for Christian.
> 
> Indeed.  Is there enough benefit to risk jamming this thing into 4.3?
> I believe that 4.4 should be a no-brainer.

In that case, I'm gonna revert the threadcgroup percpu_rwsem
conversion patches through -stable and reapply them for 4.4 merge
window.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 13:03 schrieb Tejun Heo:
> Hello,
> 
> On Wed, Sep 16, 2015 at 12:58:00PM +0200, Christian Borntraeger wrote:
>> FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
>> just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
>> systemd seems to do that for the processes. 
>>
>> So a revert is really the right thing to do. In fact, I dont know if the
>> rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
>> be NOT a cold/seldom case.
> 
> Booting would usually be the hottest operation for that and it's still
> *relatively* cold path compared to the reader side which is task
> fork/exit paths.  The whole point is shift overhead from hotter reader
> side.  Can you see problems with percpu_rwsem rework?

As I said, it seems the rcu tree with that change seems to work fine on my
system. This needs a more testing on other machines, though. I guess a 
revert plus a re-add in the 4.4 merge window should give us enough test
coverage.

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Oleg Nesterov
On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 10:57, Christian Borntraeger wrote:
> > Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> >>
> >>
> >> On 15/09/2015 19:38, Paul E. McKenney wrote:
> >>> Excellent points!
> >>>
> >>> Other options in such situations include the following:
> >>>
> >>> o Rework so that the code uses call_rcu*() instead of *_expedited().
> >>>
> >>> o Maintain a per-task or per-CPU counter so that every so many
> >>>   *_expedited() invocations instead uses the non-expedited
> >>>   counterpart.  (For example, synchronize_rcu instead of
> >>>   synchronize_rcu_expedited().)
> >>
> >> Or just use ratelimit (untested):
> >
> > One of my tests was to always replace synchronize_sched_expedited with
> > synchronize_sched and things turned out to be even worse. Not sure if
> > it makes sense to test yopur in-the-middle approach?
>
> I don't think it applies here, since down_write/up_write is a
> synchronous API.
>
> If the revert isn't easy, I think backporting rcu_sync is the best bet.

I leave this to Paul and Tejun... at least I think this is not v4.2 material.

>  The issue is that rcu_sync doesn't eliminate synchronize_sched,

Yes, but it eliminates _expedited(). This is good, but otoh this means
that (say) individual __cgroup_procs_write() can take much more time.
However, it won't block the readers and/or disturb the whole system.
And percpu_up_write() doesn't do synchronize_sched() at all.

> it only
> makes it more rare.

Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
a single synchronize_sched().

> So it's possible that it isn't eliminating the root
> cause of the problem.

We will see... Just in case, currently the usage of percpu_down_write()
is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
this needs some WIP changes in rcu_sync. Plus we can do more improvements,
but this is off-topic right now.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini


On 16/09/2015 14:22, Oleg Nesterov wrote:
> >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.

According to Christian, removing the _expedited() "makes things worse"
in that the system takes ages to boot up and systemd timeouts.  So I'm
still a bit wary about anything that uses RCU for the cgroups write side.

However, rcu_sync is okay with him, so perhaps it is really really
effective.  Christian, can you instrument how many synchronize_sched
(out of the 6479 cgroup_procs_write calls) are actually executed at boot
with the rcu rework?

Paolo

> And percpu_up_write() doesn't do synchronize_sched() at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger
Am 16.09.2015 um 14:22 schrieb Oleg Nesterov:
>>  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.
> And percpu_up_write() doesn't do synchronize_sched() at all.
> 
>> it only
>> makes it more rare.
> 
> Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
> a single synchronize_sched().

And in fact it does. Paolo suggested to trace how often we call 
synchronize_sched so I applied some advanced printk debugging technology ;-)
Until login I have 41 and after starting the 70 guests this went up to 48.
Nice work.

> 
>> So it's possible that it isn't eliminating the root
>> cause of the problem.
> 
> We will see... Just in case, currently the usage of percpu_down_write()
> is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
> this needs some WIP changes in rcu_sync. Plus we can do more improvements,
> but this is off-topic right now.
> 
> Oleg.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Oleg Nesterov
On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 14:22, Oleg Nesterov wrote:
> > >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> >
> > Yes, but it eliminates _expedited(). This is good, but otoh this means
> > that (say) individual __cgroup_procs_write() can take much more time.
> > However, it won't block the readers and/or disturb the whole system.
>
> According to Christian, removing the _expedited() "makes things worse"

Yes sure, we can not just remove _expedited() from down/up_read().

> in that the system takes ages to boot up and systemd timeouts.

Yes, this is clear

> So I'm
> still a bit wary about anything that uses RCU for the cgroups write side.
>
> However, rcu_sync is okay with him, so perhaps it is really really
> effective.  Christian, can you instrument how many synchronize_sched
> (out of the 6479 cgroup_procs_write calls) are actually executed at boot
> with the rcu rework?

Heh, another change I have in mind. It would be nice to add some trace
points. But firstly we should merge the current code.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini


On 16/09/2015 16:16, Tejun Heo wrote:
> On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
>>> > > If the revert isn't easy, I think backporting rcu_sync is the best bet.
>> > 
>> > I leave this to Paul and Tejun... at least I think this is not v4.2 
>> > material.
> Will route reverts through cgroup branch.  Should be pretty painless.
> Nice job on percpu_rwsem.  It's so much better than having to come up
> with a separate middleground solution.

Thanks all for the quick resolution!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Tejun Heo
Hello,

On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
> > If the revert isn't easy, I think backporting rcu_sync is the best bet.
> 
> I leave this to Paul and Tejun... at least I think this is not v4.2 material.

Will route reverts through cgroup branch.  Should be pretty painless.
Nice job on percpu_rwsem.  It's so much better than having to come up
with a separate middleground solution.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney
On Tue, Sep 15, 2015 at 09:24:15PM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
> > Well, the decision as to what is too big for -stable is owned by the
> > -stable maintainers, not by me.
> 
> Is it tho?  Usually the subsystem maintainer knows the best and has
> most say in it.  I was mostly curious whether you'd think that the
> changes would be too risky.  If not, great.

I do hope that they would listen to what I thought about it, but at
the end of the day, it is the -stable maintainers who pull a given
patch, or don't.

> > I am suggesting trying the options and seeing what works best, then
> > working to convince people as needed.
> 
> Yeah, sure thing.  Let's wait for Christian.

Indeed.  Is there enough benefit to risk jamming this thing into 4.3?
I believe that 4.4 should be a no-brainer.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo
Hello, Paul.

On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
> Well, the decision as to what is too big for -stable is owned by the
> -stable maintainers, not by me.

Is it tho?  Usually the subsystem maintainer knows the best and has
most say in it.  I was mostly curious whether you'd think that the
changes would be too risky.  If not, great.

> I am suggesting trying the options and seeing what works best, then
> working to convince people as needed.

Yeah, sure thing.  Let's wait for Christian.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney
On Tue, Sep 15, 2015 at 06:28:11PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 15, 2015 at 02:38:30PM -0700, Paul E. McKenney wrote:
> > I did take a shot at adding the rcu_sync stuff during this past merge
> > window, but it did not converge quickly enough to make it.  It looks
> > quite good for the next merge window.  There have been changes in most
> > of the relevant areas, so probably best to just try them and see which
> > works best.
> 
> Heh, I'm having a bit of trouble following.  Are you saying that the
> changes would be too big for -stable?  If so, I'll send out reverts of
> the culprit patches and then reapply them for this cycle so that it
> can land together with the rcu changes in the next merge window, but
> it'd be great to find out whether the rcu changes are enough for the
> issue that Christian is seeing to go away.  If not, I'll switch to a
> different locking scheme and mark those patches w/ stable tag.

Well, the decision as to what is too big for -stable is owned by the
-stable maintainers, not by me.

I am suggesting trying the options and seeing what works best, then
working to convince people as needed.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo
Hello,

On Tue, Sep 15, 2015 at 02:38:30PM -0700, Paul E. McKenney wrote:
> I did take a shot at adding the rcu_sync stuff during this past merge
> window, but it did not converge quickly enough to make it.  It looks
> quite good for the next merge window.  There have been changes in most
> of the relevant areas, so probably best to just try them and see which
> works best.

Heh, I'm having a bit of trouble following.  Are you saying that the
changes would be too big for -stable?  If so, I'll send out reverts of
the culprit patches and then reapply them for this cycle so that it
can land together with the rcu changes in the next merge window, but
it'd be great to find out whether the rcu changes are enough for the
issue that Christian is seeing to go away.  If not, I'll switch to a
different locking scheme and mark those patches w/ stable tag.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney
On Tue, Sep 15, 2015 at 05:26:22PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 15, 2015 at 11:11:45PM +0200, Christian Borntraeger wrote:
> > > In fact, I would say that any userspace-controlled call to *_expedited()
> > > is a bug waiting to happen and a bad idea---because userspace can, with
> > > little effort, end up calling it in a loop.
> > 
> > Right. This also implies that we should fix this for 4.2 - I guess.
> 
> Are the percpu_rwsem changes enough?  If so, we can try to backport
> those.  If those are too risky, we can revert the patches which
> switched threadgroup lock to percpu_rwsem.

I did take a shot at adding the rcu_sync stuff during this past merge
window, but it did not converge quickly enough to make it.  It looks
quite good for the next merge window.  There have been changes in most
of the relevant areas, so probably best to just try them and see which
works best.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo
Hello,

On Tue, Sep 15, 2015 at 11:11:45PM +0200, Christian Borntraeger wrote:
> > In fact, I would say that any userspace-controlled call to *_expedited()
> > is a bug waiting to happen and a bad idea---because userspace can, with
> > little effort, end up calling it in a loop.
> 
> Right. This also implies that we should fix this for 4.2 - I guess.

Are the percpu_rwsem changes enough?  If so, we can try to backport
those.  If those are too risky, we can revert the patches which
switched threadgroup lock to percpu_rwsem.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Christian Borntraeger
Am 15.09.2015 um 18:42 schrieb Paolo Bonzini:
> 
> 
> On 15/09/2015 15:36, Christian Borntraeger wrote:
>> I am wondering why the old code behaved in such fatal ways. Is there
>> some interaction between waiting for a reschedule in the
>> synchronize_sched writer and some fork code actually waiting for the
>> read side to get the lock together with some rescheduling going on
>> waiting for a lock that fork holds? lockdep does not give me an hints
>> so I have no clue :-(
> 
> It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
> about it:
> 
>  * if you are using synchronize_sched_expedited() in a loop, please
>  * restructure your code to batch your updates, and then use a single
>  * synchronize_sched() instead.
> 
> and you may remember that in KVM we switched from RCU to SRCU exactly to
> avoid userspace-controlled synchronize_rcu_expedited().
> 
> In fact, I would say that any userspace-controlled call to *_expedited()
> is a bug waiting to happen and a bad idea---because userspace can, with
> little effort, end up calling it in a loop.

Right. This also implies that we should fix this for 4.2 - I guess.

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney
On Tue, Sep 15, 2015 at 06:42:19PM +0200, Paolo Bonzini wrote:
> 
> 
> On 15/09/2015 15:36, Christian Borntraeger wrote:
> > I am wondering why the old code behaved in such fatal ways. Is there
> > some interaction between waiting for a reschedule in the
> > synchronize_sched writer and some fork code actually waiting for the
> > read side to get the lock together with some rescheduling going on
> > waiting for a lock that fork holds? lockdep does not give me an hints
> > so I have no clue :-(
> 
> It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
> about it:
> 
>  * if you are using synchronize_sched_expedited() in a loop, please
>  * restructure your code to batch your updates, and then use a single
>  * synchronize_sched() instead.
> 
> and you may remember that in KVM we switched from RCU to SRCU exactly to
> avoid userspace-controlled synchronize_rcu_expedited().
> 
> In fact, I would say that any userspace-controlled call to *_expedited()
> is a bug waiting to happen and a bad idea---because userspace can, with
> little effort, end up calling it in a loop.

Excellent points!

Other options in such situations include the following:

o   Rework so that the code uses call_rcu*() instead of *_expedited().

o   Maintain a per-task or per-CPU counter so that every so many
*_expedited() invocations instead uses the non-expedited
counterpart.  (For example, synchronize_rcu instead of
synchronize_rcu_expedited().)

Note that synchronize_srcu_expedited() is less troublesome than are the
other *_expedited() functions, because synchronize_srcu_expedited() does
not inflict OS jitter on other CPUs.  This situation is being improved,
so that the other *_expedited() functions inflict less OS jitter and
(mostly) avoid inflicting OS jitter on nohz_full CPUs and idle CPUs (the
latter being important for battery-powered systems).  In addition, the
*_expedited() functions avoid hammering CPUs with N-squared OS jitter
in response to concurrent invocation from all CPUs because multiple
concurrent *_expedited() calls will be satisfied by a single expedited
grace-period operation.  Nevertheless, as Paolo points out, it is still
necessary to exercise caution when exposing synchronous grace periods
to userspace control.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paolo Bonzini


On 15/09/2015 15:36, Christian Borntraeger wrote:
> I am wondering why the old code behaved in such fatal ways. Is there
> some interaction between waiting for a reschedule in the
> synchronize_sched writer and some fork code actually waiting for the
> read side to get the lock together with some rescheduling going on
> waiting for a lock that fork holds? lockdep does not give me an hints
> so I have no clue :-(

It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
about it:

 * if you are using synchronize_sched_expedited() in a loop, please
 * restructure your code to batch your updates, and then use a single
 * synchronize_sched() instead.

and you may remember that in KVM we switched from RCU to SRCU exactly to
avoid userspace-controlled synchronize_rcu_expedited().

In fact, I would say that any userspace-controlled call to *_expedited()
is a bug waiting to happen and a bad idea---because userspace can, with
little effort, end up calling it in a loop.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo
Hello,

On Tue, Sep 15, 2015 at 03:36:34PM +0200, Christian Borntraeger wrote:
> >> The problem seems to be that the newly used percpu_rwsem does a
> >> rcu_synchronize_sched_expedited for all write downs/ups.
> > 
> > Can you try:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git 
> > dev.2015.09.11ab
> 
> yes, dev.2015.09.11a seems to help, thanks. Getting rid of the expedited 
> hammer was
> really helpful - I guess.

Ah, that's nice.  I mentioned this in the original patchset but
percpu_rwsem as previously implemented could be too heavy on the
writer side for this path and I was planning to implement rwsem based
lglock if this blows up.  That said, if Oleg's changes makes the issue
go away, all the better.

> > those include Oleg's rework of the percpu rwsem which should hopefully
> > improve things somewhat.
> > 
> > But yes, pounding a global lock on a big machine will always suck.
> 
> By hacking out the fast path I actually degraded percpu rwsem to a real 
> global lock, but
> things were still a lot faster. 
> I am wondering why the old code behaved in such fatal ways. Is there some 
> interaction 
> between waiting for a reschedule in the synchronize_sched writer and some 
> fork code 
> actually waiting for the read side to get the lock together with some 
> rescheduling going
> on waiting for a lock that fork holds? lockdep does not give me an hints so I 
> have no clue :-(

percpu_rwsem is a global lock.  My rough suspicion is that probably
the writer locking path was taking too long (especially if the kernel
has preemption disabled) making the writers getting backed up badly
starving the readers.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Christian Borntraeger
Am 15.09.2015 um 15:05 schrieb Peter Zijlstra:
> On Tue, Sep 15, 2015 at 02:05:14PM +0200, Christian Borntraeger wrote:
>> Tejun,
>>
>>
>> commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace 
>> signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably
>> hickups when starting several kvm guests (which libvirt will move into 
>> cgroups
>> - each vcpu thread and each i/o thread)
>> When you now start lots of guests in parallel on a bigger system (32CPUs with
>> 2way smt in my case) the system is so busy that systemd runs into several 
>> timeouts
>> like "Did not receive a reply. Possible causes include: the remote 
>> application did
>> not send a reply, the message bus security policy blocked the reply, the 
>> reply
>> timeout expired, or the network connection was broken."
>>
>> The problem seems to be that the newly used percpu_rwsem does a
>> rcu_synchronize_sched_expedited for all write downs/ups.
> 
> Can you try:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git 
> dev.2015.09.11ab

yes, dev.2015.09.11a seems to help, thanks. Getting rid of the expedited hammer 
was
really helpful - I guess.

> 
> those include Oleg's rework of the percpu rwsem which should hopefully
> improve things somewhat.
> 
> But yes, pounding a global lock on a big machine will always suck.

By hacking out the fast path I actually degraded percpu rwsem to a real global 
lock, but
things were still a lot faster. 
I am wondering why the old code behaved in such fatal ways. Is there some 
interaction 
between waiting for a reschedule in the synchronize_sched writer and some fork 
code 
actually waiting for the read side to get the lock together with some 
rescheduling going
on waiting for a lock that fork holds? lockdep does not give me an hints so I 
have no clue :-(


Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Peter Zijlstra
On Tue, Sep 15, 2015 at 02:05:14PM +0200, Christian Borntraeger wrote:
> Tejun,
> 
> 
> commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace 
> signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably
> hickups when starting several kvm guests (which libvirt will move into cgroups
> - each vcpu thread and each i/o thread)
> When you now start lots of guests in parallel on a bigger system (32CPUs with
> 2way smt in my case) the system is so busy that systemd runs into several 
> timeouts
> like "Did not receive a reply. Possible causes include: the remote 
> application did
> not send a reply, the message bus security policy blocked the reply, the reply
> timeout expired, or the network connection was broken."
> 
> The problem seems to be that the newly used percpu_rwsem does a
> rcu_synchronize_sched_expedited for all write downs/ups.

Can you try:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git 
dev.2015.09.11ab

those include Oleg's rework of the percpu rwsem which should hopefully
improve things somewhat.

But yes, pounding a global lock on a big machine will always suck.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney
On Tue, Sep 15, 2015 at 09:24:15PM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
> > Well, the decision as to what is too big for -stable is owned by the
> > -stable maintainers, not by me.
> 
> Is it tho?  Usually the subsystem maintainer knows the best and has
> most say in it.  I was mostly curious whether you'd think that the
> changes would be too risky.  If not, great.

I do hope that they would listen to what I thought about it, but at
the end of the day, it is the -stable maintainers who pull a given
patch, or don't.

> > I am suggesting trying the options and seeing what works best, then
> > working to convince people as needed.
> 
> Yeah, sure thing.  Let's wait for Christian.

Indeed.  Is there enough benefit to risk jamming this thing into 4.3?
I believe that 4.4 should be a no-brainer.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo
Hello, Paul.

On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
> Well, the decision as to what is too big for -stable is owned by the
> -stable maintainers, not by me.

Is it tho?  Usually the subsystem maintainer knows the best and has
most say in it.  I was mostly curious whether you'd think that the
changes would be too risky.  If not, great.

> I am suggesting trying the options and seeing what works best, then
> working to convince people as needed.

Yeah, sure thing.  Let's wait for Christian.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo
Hello,

On Tue, Sep 15, 2015 at 02:38:30PM -0700, Paul E. McKenney wrote:
> I did take a shot at adding the rcu_sync stuff during this past merge
> window, but it did not converge quickly enough to make it.  It looks
> quite good for the next merge window.  There have been changes in most
> of the relevant areas, so probably best to just try them and see which
> works best.

Heh, I'm having a bit of trouble following.  Are you saying that the
changes would be too big for -stable?  If so, I'll send out reverts of
the culprit patches and then reapply them for this cycle so that it
can land together with the rcu changes in the next merge window, but
it'd be great to find out whether the rcu changes are enough for the
issue that Christian is seeing to go away.  If not, I'll switch to a
different locking scheme and mark those patches w/ stable tag.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney
On Tue, Sep 15, 2015 at 06:28:11PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 15, 2015 at 02:38:30PM -0700, Paul E. McKenney wrote:
> > I did take a shot at adding the rcu_sync stuff during this past merge
> > window, but it did not converge quickly enough to make it.  It looks
> > quite good for the next merge window.  There have been changes in most
> > of the relevant areas, so probably best to just try them and see which
> > works best.
> 
> Heh, I'm having a bit of trouble following.  Are you saying that the
> changes would be too big for -stable?  If so, I'll send out reverts of
> the culprit patches and then reapply them for this cycle so that it
> can land together with the rcu changes in the next merge window, but
> it'd be great to find out whether the rcu changes are enough for the
> issue that Christian is seeing to go away.  If not, I'll switch to a
> different locking scheme and mark those patches w/ stable tag.

Well, the decision as to what is too big for -stable is owned by the
-stable maintainers, not by me.

I am suggesting trying the options and seeing what works best, then
working to convince people as needed.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paolo Bonzini


On 15/09/2015 15:36, Christian Borntraeger wrote:
> I am wondering why the old code behaved in such fatal ways. Is there
> some interaction between waiting for a reschedule in the
> synchronize_sched writer and some fork code actually waiting for the
> read side to get the lock together with some rescheduling going on
> waiting for a lock that fork holds? lockdep does not give me an hints
> so I have no clue :-(

It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
about it:

 * if you are using synchronize_sched_expedited() in a loop, please
 * restructure your code to batch your updates, and then use a single
 * synchronize_sched() instead.

and you may remember that in KVM we switched from RCU to SRCU exactly to
avoid userspace-controlled synchronize_rcu_expedited().

In fact, I would say that any userspace-controlled call to *_expedited()
is a bug waiting to happen and a bad idea---because userspace can, with
little effort, end up calling it in a loop.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney
On Tue, Sep 15, 2015 at 06:42:19PM +0200, Paolo Bonzini wrote:
> 
> 
> On 15/09/2015 15:36, Christian Borntraeger wrote:
> > I am wondering why the old code behaved in such fatal ways. Is there
> > some interaction between waiting for a reschedule in the
> > synchronize_sched writer and some fork code actually waiting for the
> > read side to get the lock together with some rescheduling going on
> > waiting for a lock that fork holds? lockdep does not give me an hints
> > so I have no clue :-(
> 
> It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
> about it:
> 
>  * if you are using synchronize_sched_expedited() in a loop, please
>  * restructure your code to batch your updates, and then use a single
>  * synchronize_sched() instead.
> 
> and you may remember that in KVM we switched from RCU to SRCU exactly to
> avoid userspace-controlled synchronize_rcu_expedited().
> 
> In fact, I would say that any userspace-controlled call to *_expedited()
> is a bug waiting to happen and a bad idea---because userspace can, with
> little effort, end up calling it in a loop.

Excellent points!

Other options in such situations include the following:

o   Rework so that the code uses call_rcu*() instead of *_expedited().

o   Maintain a per-task or per-CPU counter so that every so many
*_expedited() invocations instead uses the non-expedited
counterpart.  (For example, synchronize_rcu instead of
synchronize_rcu_expedited().)

Note that synchronize_srcu_expedited() is less troublesome than are the
other *_expedited() functions, because synchronize_srcu_expedited() does
not inflict OS jitter on other CPUs.  This situation is being improved,
so that the other *_expedited() functions inflict less OS jitter and
(mostly) avoid inflicting OS jitter on nohz_full CPUs and idle CPUs (the
latter being important for battery-powered systems).  In addition, the
*_expedited() functions avoid hammering CPUs with N-squared OS jitter
in response to concurrent invocation from all CPUs because multiple
concurrent *_expedited() calls will be satisfied by a single expedited
grace-period operation.  Nevertheless, as Paolo points out, it is still
necessary to exercise caution when exposing synchronous grace periods
to userspace control.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Christian Borntraeger
Am 15.09.2015 um 18:42 schrieb Paolo Bonzini:
> 
> 
> On 15/09/2015 15:36, Christian Borntraeger wrote:
>> I am wondering why the old code behaved in such fatal ways. Is there
>> some interaction between waiting for a reschedule in the
>> synchronize_sched writer and some fork code actually waiting for the
>> read side to get the lock together with some rescheduling going on
>> waiting for a lock that fork holds? lockdep does not give me an hints
>> so I have no clue :-(
> 
> It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
> about it:
> 
>  * if you are using synchronize_sched_expedited() in a loop, please
>  * restructure your code to batch your updates, and then use a single
>  * synchronize_sched() instead.
> 
> and you may remember that in KVM we switched from RCU to SRCU exactly to
> avoid userspace-controlled synchronize_rcu_expedited().
> 
> In fact, I would say that any userspace-controlled call to *_expedited()
> is a bug waiting to happen and a bad idea---because userspace can, with
> little effort, end up calling it in a loop.

Right. This also implies that we should fix this for 4.2 - I guess.

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo
Hello,

On Tue, Sep 15, 2015 at 11:11:45PM +0200, Christian Borntraeger wrote:
> > In fact, I would say that any userspace-controlled call to *_expedited()
> > is a bug waiting to happen and a bad idea---because userspace can, with
> > little effort, end up calling it in a loop.
> 
> Right. This also implies that we should fix this for 4.2 - I guess.

Are the percpu_rwsem changes enough?  If so, we can try to backport
those.  If those are too risky, we can revert the patches which
switched threadgroup lock to percpu_rwsem.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney
On Tue, Sep 15, 2015 at 05:26:22PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 15, 2015 at 11:11:45PM +0200, Christian Borntraeger wrote:
> > > In fact, I would say that any userspace-controlled call to *_expedited()
> > > is a bug waiting to happen and a bad idea---because userspace can, with
> > > little effort, end up calling it in a loop.
> > 
> > Right. This also implies that we should fix this for 4.2 - I guess.
> 
> Are the percpu_rwsem changes enough?  If so, we can try to backport
> those.  If those are too risky, we can revert the patches which
> switched threadgroup lock to percpu_rwsem.

I did take a shot at adding the rcu_sync stuff during this past merge
window, but it did not converge quickly enough to make it.  It looks
quite good for the next merge window.  There have been changes in most
of the relevant areas, so probably best to just try them and see which
works best.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Christian Borntraeger
Am 15.09.2015 um 15:05 schrieb Peter Zijlstra:
> On Tue, Sep 15, 2015 at 02:05:14PM +0200, Christian Borntraeger wrote:
>> Tejun,
>>
>>
>> commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace 
>> signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably
>> hickups when starting several kvm guests (which libvirt will move into 
>> cgroups
>> - each vcpu thread and each i/o thread)
>> When you now start lots of guests in parallel on a bigger system (32CPUs with
>> 2way smt in my case) the system is so busy that systemd runs into several 
>> timeouts
>> like "Did not receive a reply. Possible causes include: the remote 
>> application did
>> not send a reply, the message bus security policy blocked the reply, the 
>> reply
>> timeout expired, or the network connection was broken."
>>
>> The problem seems to be that the newly used percpu_rwsem does a
>> rcu_synchronize_sched_expedited for all write downs/ups.
> 
> Can you try:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git 
> dev.2015.09.11ab

yes, dev.2015.09.11a seems to help, thanks. Getting rid of the expedited hammer 
was
really helpful - I guess.

> 
> those include Oleg's rework of the percpu rwsem which should hopefully
> improve things somewhat.
> 
> But yes, pounding a global lock on a big machine will always suck.

By hacking out the fast path I actually degraded percpu rwsem to a real global 
lock, but
things were still a lot faster. 
I am wondering why the old code behaved in such fatal ways. Is there some 
interaction 
between waiting for a reschedule in the synchronize_sched writer and some fork 
code 
actually waiting for the read side to get the lock together with some 
rescheduling going
on waiting for a lock that fork holds? lockdep does not give me an hints so I 
have no clue :-(


Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo
Hello,

On Tue, Sep 15, 2015 at 03:36:34PM +0200, Christian Borntraeger wrote:
> >> The problem seems to be that the newly used percpu_rwsem does a
> >> rcu_synchronize_sched_expedited for all write downs/ups.
> > 
> > Can you try:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git 
> > dev.2015.09.11ab
> 
> yes, dev.2015.09.11a seems to help, thanks. Getting rid of the expedited 
> hammer was
> really helpful - I guess.

Ah, that's nice.  I mentioned this in the original patchset but
percpu_rwsem as previously implemented could be too heavy on the
writer side for this path and I was planning to implement rwsem based
lglock if this blows up.  That said, if Oleg's changes makes the issue
go away, all the better.

> > those include Oleg's rework of the percpu rwsem which should hopefully
> > improve things somewhat.
> > 
> > But yes, pounding a global lock on a big machine will always suck.
> 
> By hacking out the fast path I actually degraded percpu rwsem to a real 
> global lock, but
> things were still a lot faster. 
> I am wondering why the old code behaved in such fatal ways. Is there some 
> interaction 
> between waiting for a reschedule in the synchronize_sched writer and some 
> fork code 
> actually waiting for the read side to get the lock together with some 
> rescheduling going
> on waiting for a lock that fork holds? lockdep does not give me an hints so I 
> have no clue :-(

percpu_rwsem is a global lock.  My rough suspicion is that probably
the writer locking path was taking too long (especially if the kernel
has preemption disabled) making the writers getting backed up badly
starving the readers.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Peter Zijlstra
On Tue, Sep 15, 2015 at 02:05:14PM +0200, Christian Borntraeger wrote:
> Tejun,
> 
> 
> commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace 
> signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably
> hickups when starting several kvm guests (which libvirt will move into cgroups
> - each vcpu thread and each i/o thread)
> When you now start lots of guests in parallel on a bigger system (32CPUs with
> 2way smt in my case) the system is so busy that systemd runs into several 
> timeouts
> like "Did not receive a reply. Possible causes include: the remote 
> application did
> not send a reply, the message bus security policy blocked the reply, the reply
> timeout expired, or the network connection was broken."
> 
> The problem seems to be that the newly used percpu_rwsem does a
> rcu_synchronize_sched_expedited for all write downs/ups.

Can you try:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git 
dev.2015.09.11ab

those include Oleg's rework of the percpu rwsem which should hopefully
improve things somewhat.

But yes, pounding a global lock on a big machine will always suck.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/