Re: [PATCH v4] ovl: fix dentry leak in ovl_get_redirect

2021-01-07 Thread Liangyan

ping

On 20/12/22 上午11:06, Liangyan wrote:

We need to lock d_parent->d_lock before dget_dlock, or this may
have d_lockref updated parallelly like calltrace below which will
cause dentry->d_lockref leak and risk a crash.

  CPU 0CPU 1
ovl_set_redirect   lookup_fast
   ovl_get_redirect   __d_lookup
 dget_dlock
   //no lock protection herespin_lock(&dentry->d_lock)
   dentry->d_lockref.count++dentry->d_lockref.count++

[   49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0
[   49.799689] Oops: 0002 [#1] SMP PTI
[   49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 
#1
[   49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 
04/01/2014
[   49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.803949] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.804600] RDX: 0001 RSI: 000a RDI: 0088
[   49.805252] RBP:  R08:  R09: 993cf040
[   49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.807200] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.807935] CS:  0010 DS:  ES:  CR0: 80050033
[   49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.809113] DR0:  DR1:  DR2: 
[   49.809758] DR3:  DR6: fffe0ff0 DR7: 0400
[   49.810410] Call Trace:
[   49.810653]  d_delete+0x2c/0xb0
[   49.810951]  vfs_rmdir+0xfd/0x120
[   49.811264]  do_rmdir+0x14f/0x1a0
[   49.811573]  do_syscall_64+0x5b/0x190
[   49.811917]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   49.812385] RIP: 0033:0x7ffbf505ffd7
[   49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 
0054
[   49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7
[   49.815744] RDX: 04449700 RSI:  RDI: 06c8cd50
[   49.816394] RBP: 7ffbedffaea0 R08:  R09: 00017d0b
[   49.817038] R10:  R11: 0297 R12: 0012
[   49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8
[   49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd 
libps2 i8042
[   49.819052] CR2: 0088
[   49.819368] ---[ end trace 4e652b8aa299aa2d ]---
[   49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.822363] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.823008] RDX: 0001 RSI: 000a RDI: 0088
[   49.823658] RBP:  R08:  R09: 993cf040
[   49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.828890] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.830725] CS:  0010 DS:  ES:  CR0: 80050033
[   49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.834085] DR0:  DR1:  DR2: 
[   49.835792] DR3:  DR6: fffe0ff0 DR7: 0400

Cc: 
Fixes: a6c606551141 ("ovl: redirect on rename-dir")
Signed-off-by: Liangyan 
Reviewed-by: Joseph Qi 
Suggested-by: Al Viro 
---
  fs/overlayfs/dir.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 28a075b5f5b2..d1efa3a5a503 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -992,8 +992,8 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
  
  		buflen -= thislen;

memcpy(&buf[buflen], name, thislen);
-   tmp = dget_dlock(d->d_parent);
spin_unlock(&d->d_lock);
+   tmp = dget_parent(d);
  
  		dput(d);

d = tmp;



Re: [PATCH v4] ovl: fix dentry leak in ovl_get_redirect

2020-12-22 Thread Liangyan

Thanks Viro.

@Miklos, can you please advise?

On 20/12/22 上午11:26, Al Viro wrote:

On Tue, Dec 22, 2020 at 11:06:26AM +0800, Liangyan wrote:


Cc: 
Fixes: a6c606551141 ("ovl: redirect on rename-dir")
Signed-off-by: Liangyan 
Reviewed-by: Joseph Qi 
Suggested-by: Al Viro 


Fine by me...  I can put it through vfs.git#fixes, but IMO
that would be better off in overlayfs tree.



[PATCH v4] ovl: fix dentry leak in ovl_get_redirect

2020-12-21 Thread Liangyan
We need to lock d_parent->d_lock before dget_dlock, or this may
have d_lockref updated parallelly like calltrace below which will
cause dentry->d_lockref leak and risk a crash.

 CPU 0CPU 1
ovl_set_redirect   lookup_fast
  ovl_get_redirect   __d_lookup
dget_dlock
  //no lock protection herespin_lock(&dentry->d_lock)
  dentry->d_lockref.count++dentry->d_lockref.count++

[   49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0
[   49.799689] Oops: 0002 [#1] SMP PTI
[   49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 
#1
[   49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 
04/01/2014
[   49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.803949] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.804600] RDX: 0001 RSI: 000a RDI: 0088
[   49.805252] RBP:  R08:  R09: 993cf040
[   49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.807200] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.807935] CS:  0010 DS:  ES:  CR0: 80050033
[   49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.809113] DR0:  DR1:  DR2: 
[   49.809758] DR3:  DR6: fffe0ff0 DR7: 0400
[   49.810410] Call Trace:
[   49.810653]  d_delete+0x2c/0xb0
[   49.810951]  vfs_rmdir+0xfd/0x120
[   49.811264]  do_rmdir+0x14f/0x1a0
[   49.811573]  do_syscall_64+0x5b/0x190
[   49.811917]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   49.812385] RIP: 0033:0x7ffbf505ffd7
[   49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 
0054
[   49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7
[   49.815744] RDX: 04449700 RSI:  RDI: 06c8cd50
[   49.816394] RBP: 7ffbedffaea0 R08:  R09: 00017d0b
[   49.817038] R10:  R11: 0297 R12: 0012
[   49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8
[   49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd 
libps2 i8042
[   49.819052] CR2: 0088
[   49.819368] ---[ end trace 4e652b8aa299aa2d ]---
[   49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.822363] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.823008] RDX: 0001 RSI: 000a RDI: 0088
[   49.823658] RBP:  R08:  R09: 993cf040
[   49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.828890] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.830725] CS:  0010 DS:  ES:  CR0: 80050033
[   49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.834085] DR0:  DR1:  DR2: 
[   49.835792] DR3:  DR6: fffe0ff0 DR7: 0400

Cc: 
Fixes: a6c606551141 ("ovl: redirect on rename-dir")
Signed-off-by: Liangyan 
Reviewed-by: Joseph Qi 
Suggested-by: Al Viro 
---
 fs/overlayfs/dir.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 28a075b5f5b2..d1efa3a5a503 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -992,8 +992,8 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
 
buflen -= thislen;
memcpy(&buf[buflen], name, thislen);
-   tmp = dget_dlock(d->d_parent);
spin_unlock(&d->d_lock);
+   tmp = dget_parent(d);
 
dput(d);
d = tmp;
-- 
2.14.4.44.g2045bb6



[PATCH v3] ovl: fix dentry leak in ovl_get_redirect

2020-12-21 Thread Liangyan
We need to lock d_parent->d_lock before dget_dlock, or this may
have d_lockref updated parallelly like calltrace below which will
cause dentry->d_lockref leak and risk a crash.

 CPU 0CPU 1
ovl_set_redirect   lookup_fast
  ovl_get_redirect   __d_lookup
dget_dlock
  //no lock protection herespin_lock(&dentry->d_lock)
  dentry->d_lockref.count++dentry->d_lockref.count++

[   49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0
[   49.799689] Oops: 0002 [#1] SMP PTI
[   49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 
#1
[   49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 
04/01/2014
[   49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.803949] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.804600] RDX: 0001 RSI: 000a RDI: 0088
[   49.805252] RBP:  R08:  R09: 993cf040
[   49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.807200] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.807935] CS:  0010 DS:  ES:  CR0: 80050033
[   49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.809113] DR0:  DR1:  DR2: 
[   49.809758] DR3:  DR6: fffe0ff0 DR7: 0400
[   49.810410] Call Trace:
[   49.810653]  d_delete+0x2c/0xb0
[   49.810951]  vfs_rmdir+0xfd/0x120
[   49.811264]  do_rmdir+0x14f/0x1a0
[   49.811573]  do_syscall_64+0x5b/0x190
[   49.811917]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   49.812385] RIP: 0033:0x7ffbf505ffd7
[   49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 
0054
[   49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7
[   49.815744] RDX: 04449700 RSI:  RDI: 06c8cd50
[   49.816394] RBP: 7ffbedffaea0 R08:  R09: 00017d0b
[   49.817038] R10:  R11: 0297 R12: 0012
[   49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8
[   49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd 
libps2 i8042
[   49.819052] CR2: 0088
[   49.819368] ---[ end trace 4e652b8aa299aa2d ]---
[   49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.822363] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.823008] RDX: 0001 RSI: 000a RDI: 0088
[   49.823658] RBP:  R08:  R09: 993cf040
[   49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.828890] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.830725] CS:  0010 DS:  ES:  CR0: 80050033
[   49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.834085] DR0:  DR1:  DR2: 
[   49.835792] DR3:  DR6: fffe0ff0 DR7: 0400

Fixes: a6c606551141 ("ovl: redirect on rename-dir")
Signed-off-by: Liangyan 
Reviewed-by: Joseph Qi 
Suggested-by: Al Viro 
---
 fs/overlayfs/dir.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 28a075b5f5b2..e9aa4a12ad82 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
for (d = dget(dentry); !IS_ROOT(d);) {
const char *name;
int thislen;
+   struct dentry *parent = NULL;
 
spin_lock(&d->d_lock);
name = ovl_dentry_get_redirect(d);
@@ -992,11 +993,10 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
 
buflen -= thislen;
memcpy(&buf[buflen], name, thislen);
-   tmp = dget_dlock(d->d_parent);
spin_unlock(&d->d_lock);
-
+   parent = dget_parent(d);
dput(d);
-   d = tmp;
+   d = parent;
 
/* Absolute redirect: finished */
if (buf[buflen] == '/')
-- 
2.14.4.44.g2045bb6



Re: [PATCH v2] ovl: fix dentry leak in ovl_get_redirect

2020-12-21 Thread Liangyan
Exactly, i missed this definition of d_lock and treat it as a single 
member in dentry.

#define d_lock  d_lockref.lock

Thanks for the explanation. i will post a new patch as your suggestion.

Regards,
Liangyan

On 20/12/22 上午1:35, Al Viro wrote:

On Tue, Dec 22, 2020 at 12:51:27AM +0800, Liangyan wrote:

This is the race scenario based on call trace we captured which cause the
dentry leak.


  CPU 0CPU 1
ovl_set_redirect   lookup_fast
   ovl_get_redirect   __d_lookup
 dget_dlock
   //no lock protection herespin_lock(&dentry->d_lock)
   dentry->d_lockref.count++dentry->d_lockref.count++


If we use dget_parent instead, we may have this race.


  CPU 0CPU 1
ovl_set_redirect   lookup_fast
   ovl_get_redirect   __d_lookup
 dget_parent
   raw_seqcount_begin(&dentry->d_seq)  spin_lock(&dentry->d_lock)
   lockref_get_not_zero(&ret->d_lockref)   dentry->d_lockref.count++


And?

lockref_get_not_zero() will observe ->d_lock held and fall back to
taking it.

The whole point of lockref is that counter and spinlock are next to each
other.  Fastpath in lockref_get_not_zero is cmpxchg on both, and
it is taken only if ->d_lock is *NOT* locked.  And the slow path
there will do spin_lock() around the manipulations of ->count.

Note that ->d_lock is simply ->d_lockref.lock; ->d_seq has nothing
to do with the whole thing.

The race in mainline is real; if you can observe anything of that
sort with dget_parent(), we have much worse problem.  Consider
dget() vs. lookup_fast() - no overlayfs weirdness in sight and the
same kind of concurrent access.

Again, lockref primitives can be safely mixed with other threads
doing operations on ->count while holding ->lock.



Re: [PATCH v2] ovl: fix dentry leak in ovl_get_redirect

2020-12-21 Thread Liangyan
This is the race scenario based on call trace we captured which cause 
the dentry leak.



 CPU 0CPU 1
ovl_set_redirect   lookup_fast
  ovl_get_redirect   __d_lookup
dget_dlock
  //no lock protection herespin_lock(&dentry->d_lock)
  dentry->d_lockref.count++dentry->d_lockref.count++


If we use dget_parent instead, we may have this race.


 CPU 0CPU 1
ovl_set_redirect   lookup_fast
  ovl_get_redirect   __d_lookup
dget_parent
  raw_seqcount_begin(&dentry->d_seq)  spin_lock(&dentry->d_lock)
  lockref_get_not_zero(&ret->d_lockref)   dentry->d_lockref.count++ 




On 20/12/21 下午8:11, Al Viro wrote:

On Mon, Dec 21, 2020 at 07:14:44PM +0800, Joseph Qi wrote:

Hi Viro,

On 12/21/20 2:26 PM, Al Viro wrote:

On Sun, Dec 20, 2020 at 08:09:27PM +0800, Liangyan wrote:


+++ b/fs/overlayfs/dir.c
@@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
for (d = dget(dentry); !IS_ROOT(d);) {
const char *name;
int thislen;
+   struct dentry *parent = NULL;
  
  		spin_lock(&d->d_lock);

name = ovl_dentry_get_redirect(d);
@@ -992,7 +993,22 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
  
  		buflen -= thislen;

memcpy(&buf[buflen], name, thislen);
-   tmp = dget_dlock(d->d_parent);
+   parent = d->d_parent;
+   if (unlikely(!spin_trylock(&parent->d_lock))) {
+   rcu_read_lock();
+   spin_unlock(&d->d_lock);
+again:
+   parent = READ_ONCE(d->d_parent);
+   spin_lock(&parent->d_lock);
+   if (unlikely(parent != d->d_parent)) {
+   spin_unlock(&parent->d_lock);
+   goto again;
+   }
+   rcu_read_unlock();
+   spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
+   }
+   tmp = dget_dlock(parent);
+   spin_unlock(&parent->d_lock);
spin_unlock(&d->d_lock);


Yecc  What's wrong with just doing
spin_unlock(&d->d_lock);
parent = dget_parent(d);
dput(d);
d = parent;
instead of that?



Now race happens on non-RCU path in lookup_fast(), I'm afraid d_seq can
not close the race window.


Explain, please.  What exactly are you observing?



Re: [PATCH v2] ovl: fix dentry leak in ovl_get_redirect

2020-12-20 Thread Liangyan
Guys, any comments on this patch? This issue should exist in latest 
upstream.


On 20/12/20 下午8:09, Liangyan wrote:

We need to lock d_parent->d_lock before dget_dlock, or this may
have d_lockref updated parallelly like calltrace below which will
cause dentry->d_lockref leak and risk a crash.

npm-20576 [028]  5705749.040094:
[28] ovl_set_redirect+0x11c/0x310 //tmp = dget_dlock(d->d_parent);
[28]?  ovl_set_redirect+0x5/0x310
[28] ovl_rename+0x4db/0x790 [overlay]
[28] vfs_rename+0x6e8/0x920
[28] do_renameat2+0x4d6/0x560
[28] __x64_sys_rename+0x1c/0x20
[28] do_syscall_64+0x55/0x1a0
[28] entry_SYSCALL_64_after_hwframe+0x44/0xa9

npm-20574 [036]  5705749.040094:
[36] __d_lookup+0x107/0x140 //dentry->d_lockref.count++;
[36] lookup_fast+0xe0/0x2d0
[36] walk_component+0x48/0x350
[36] link_path_walk+0x1bf/0x650
[36]?  path_init+0x1f6/0x2f0
[36] path_lookupat+0x82/0x210
[36] filename_lookup+0xb8/0x1a0
[36]?  __audit_getname+0xa2/0xb0
[36]?  getname_flags+0xb9/0x1e0
[36]?  vfs_statx+0x73/0xe0
[36] vfs_statx+0x73/0xe0
[36] __do_sys_statx+0x3b/0x80
[36]?  syscall_trace_enter+0x1ae/0x2c0
[36] do_syscall_64+0x55/0x1a0
[36] entry_SYSCALL_64_

[   49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0
[   49.799689] Oops: 0002 [#1] SMP PTI
[   49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 
#1
[   49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 
04/01/2014
[   49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.803949] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.804600] RDX: 0001 RSI: 000a RDI: 0088
[   49.805252] RBP:  R08:  R09: 993cf040
[   49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.807200] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.807935] CS:  0010 DS:  ES:  CR0: 80050033
[   49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.809113] DR0:  DR1:  DR2: 
[   49.809758] DR3:  DR6: fffe0ff0 DR7: 0400
[   49.810410] Call Trace:
[   49.810653]  d_delete+0x2c/0xb0
[   49.810951]  vfs_rmdir+0xfd/0x120
[   49.811264]  do_rmdir+0x14f/0x1a0
[   49.811573]  do_syscall_64+0x5b/0x190
[   49.811917]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   49.812385] RIP: 0033:0x7ffbf505ffd7
[   49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 
0054
[   49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7
[   49.815744] RDX: 04449700 RSI:  RDI: 06c8cd50
[   49.816394] RBP: 7ffbedffaea0 R08:  R09: 00017d0b
[   49.817038] R10:  R11: 0297 R12: 0012
[   49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8
[   49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd 
libps2 i8042
[   49.819052] CR2: 0088
[   49.819368] ---[ end trace 4e652b8aa299aa2d ]---
[   49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.822363] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.823008] RDX: 0001 RSI: 000a RDI: 0088
[   49.823658] RBP:  R08:  R09: 993cf040
[   49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.828890] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.830725] CS:  0010 DS:  ES:  CR0: 80050033
[   49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.834085] DR0:  DR1:  DR2: 
[   49.835792] DR3:  DR6: fffe0ff0 DR7: 0400

Fixes: a6c606551141 ("ovl: redirect on rename-dir")
Signed-off-by: Liangyan 
Suggested-by: Joseph Qi 
---
  fs/overlayfs/dir.c | 18 +-
  1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 28a075b5f5b2..a78d35017371 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
for (d = dget(dentry); !IS_ROOT(d);) {
const char *name;
int thislen;
+   struct dentry *parent = NULL;
  
  		spin_lock(&d->d_lock);

name = ovl_dentry_get_redirect(d);
@@ -992,7 +993,22 @@ static char *ovl_get_redirect(struct dentr

[PATCH v2] ovl: fix dentry leak in ovl_get_redirect

2020-12-20 Thread Liangyan
We need to lock d_parent->d_lock before dget_dlock, or this may
have d_lockref updated parallelly like calltrace below which will
cause dentry->d_lockref leak and risk a crash.

npm-20576 [028]  5705749.040094:
[28] ovl_set_redirect+0x11c/0x310 //tmp = dget_dlock(d->d_parent);
[28]?  ovl_set_redirect+0x5/0x310
[28] ovl_rename+0x4db/0x790 [overlay]
[28] vfs_rename+0x6e8/0x920
[28] do_renameat2+0x4d6/0x560
[28] __x64_sys_rename+0x1c/0x20
[28] do_syscall_64+0x55/0x1a0
[28] entry_SYSCALL_64_after_hwframe+0x44/0xa9

npm-20574 [036]  5705749.040094:
[36] __d_lookup+0x107/0x140 //dentry->d_lockref.count++;
[36] lookup_fast+0xe0/0x2d0
[36] walk_component+0x48/0x350
[36] link_path_walk+0x1bf/0x650
[36]?  path_init+0x1f6/0x2f0
[36] path_lookupat+0x82/0x210
[36] filename_lookup+0xb8/0x1a0
[36]?  __audit_getname+0xa2/0xb0
[36]?  getname_flags+0xb9/0x1e0
[36]?  vfs_statx+0x73/0xe0
[36] vfs_statx+0x73/0xe0
[36] __do_sys_statx+0x3b/0x80
[36]?  syscall_trace_enter+0x1ae/0x2c0
[36] do_syscall_64+0x55/0x1a0
[36] entry_SYSCALL_64_

[   49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0
[   49.799689] Oops: 0002 [#1] SMP PTI
[   49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 
#1
[   49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 
04/01/2014
[   49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.803949] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.804600] RDX: 0001 RSI: 000a RDI: 0088
[   49.805252] RBP:  R08:  R09: 993cf040
[   49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.807200] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.807935] CS:  0010 DS:  ES:  CR0: 80050033
[   49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.809113] DR0:  DR1:  DR2: 
[   49.809758] DR3:  DR6: fffe0ff0 DR7: 0400
[   49.810410] Call Trace:
[   49.810653]  d_delete+0x2c/0xb0
[   49.810951]  vfs_rmdir+0xfd/0x120
[   49.811264]  do_rmdir+0x14f/0x1a0
[   49.811573]  do_syscall_64+0x5b/0x190
[   49.811917]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   49.812385] RIP: 0033:0x7ffbf505ffd7
[   49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 
0054
[   49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7
[   49.815744] RDX: 04449700 RSI:  RDI: 06c8cd50
[   49.816394] RBP: 7ffbedffaea0 R08:  R09: 00017d0b
[   49.817038] R10:  R11: 0297 R12: 0012
[   49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8
[   49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd 
libps2 i8042
[   49.819052] CR2: 0088
[   49.819368] ---[ end trace 4e652b8aa299aa2d ]---
[   49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.822363] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.823008] RDX: 0001 RSI: 000a RDI: 0088
[   49.823658] RBP:  R08:  R09: 993cf040
[   49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.828890] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.830725] CS:  0010 DS:  ES:  CR0: 80050033
[   49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.834085] DR0:  DR1:  DR2: 
[   49.835792] DR3:  DR6: fffe0ff0 DR7: 0400

Fixes: a6c606551141 ("ovl: redirect on rename-dir")
Signed-off-by: Liangyan 
Suggested-by: Joseph Qi 
---
 fs/overlayfs/dir.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 28a075b5f5b2..a78d35017371 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
for (d = dget(dentry); !IS_ROOT(d);) {
const char *name;
int thislen;
+   struct dentry *parent = NULL;
 
spin_lock(&d->d_lock);
name = ovl_dentry_get_redirect(d);
@@ -992,7 +993,22 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
 
buflen -= thislen;
memcpy(&buf[buflen], name, thislen);

[PATCH] ovl: fix dentry leak in ovl_get_redirect

2020-12-18 Thread Liangyan
We need to lock d_parent->d_lock before dget_dlock, or this may
have d_lockref updated parallelly like calltrace below which will
cause dentry->d_lockref leak and risk a crash.

npm-20576 [028]  5705749.040094:
[28] ovl_set_redirect+0x11c/0x310 //tmp = dget_dlock(d->d_parent);
[28]?  ovl_set_redirect+0x5/0x310
[28] ovl_rename+0x4db/0x790 [overlay]
[28] vfs_rename+0x6e8/0x920
[28] do_renameat2+0x4d6/0x560
[28] __x64_sys_rename+0x1c/0x20
[28] do_syscall_64+0x55/0x1a0
[28] entry_SYSCALL_64_after_hwframe+0x44/0xa9

npm-20574 [036]  5705749.040094:
[36] __d_lookup+0x107/0x140 //dentry->d_lockref.count++;
[36] lookup_fast+0xe0/0x2d0
[36] walk_component+0x48/0x350
[36] link_path_walk+0x1bf/0x650
[36]?  path_init+0x1f6/0x2f0
[36] path_lookupat+0x82/0x210
[36] filename_lookup+0xb8/0x1a0
[36]?  __audit_getname+0xa2/0xb0
[36]?  getname_flags+0xb9/0x1e0
[36]?  vfs_statx+0x73/0xe0
[36] vfs_statx+0x73/0xe0
[36] __do_sys_statx+0x3b/0x80
[36]?  syscall_trace_enter+0x1ae/0x2c0
[36] do_syscall_64+0x55/0x1a0
[36] entry_SYSCALL_64_

[   49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0
[   49.799689] Oops: 0002 [#1] SMP PTI
[   49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 
#1
[   49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 
04/01/2014
[   49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.803949] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.804600] RDX: 0001 RSI: 000a RDI: 0088
[   49.805252] RBP:  R08:  R09: 993cf040
[   49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.807200] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.807935] CS:  0010 DS:  ES:  CR0: 80050033
[   49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.809113] DR0:  DR1:  DR2: 
[   49.809758] DR3:  DR6: fffe0ff0 DR7: 0400
[   49.810410] Call Trace:
[   49.810653]  d_delete+0x2c/0xb0
[   49.810951]  vfs_rmdir+0xfd/0x120
[   49.811264]  do_rmdir+0x14f/0x1a0
[   49.811573]  do_syscall_64+0x5b/0x190
[   49.811917]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   49.812385] RIP: 0033:0x7ffbf505ffd7
[   49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 
0054
[   49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7
[   49.815744] RDX: 04449700 RSI:  RDI: 06c8cd50
[   49.816394] RBP: 7ffbedffaea0 R08:  R09: 00017d0b
[   49.817038] R10:  R11: 0297 R12: 0012
[   49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8
[   49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd 
libps2 i8042
[   49.819052] CR2: 0088
[   49.819368] ---[ end trace 4e652b8aa299aa2d ]---
[   49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20
[   49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246
[   49.822363] RAX:  RBX: 93b8da3446c0 RCX: 000a
[   49.823008] RDX: 0001 RSI: 000a RDI: 0088
[   49.823658] RBP:  R08:  R09: 993cf040
[   49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: 
[   49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0
[   49.828890] FS:  7ffbedffb700() GS:93b92788() 
knlGS:
[   49.830725] CS:  0010 DS:  ES:  CR0: 80050033
[   49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0
[   49.834085] DR0:  DR1:  DR2: 
[   49.835792] DR3:  DR6: fffe0ff0 DR7: 0400

Fixes: a6c606551141 ("ovl: redirect on rename-dir")
Signed-off-by: Liangyan 
Suggested-by: Joseph Qi 
---
 fs/overlayfs/dir.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 28a075b5f5b2..9831e7046038 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
for (d = dget(dentry); !IS_ROOT(d);) {
const char *name;
int thislen;
+   struct dentry *parent = NULL;
 
spin_lock(&d->d_lock);
name = ovl_dentry_get_redirect(d);
@@ -992,7 +993,26 @@ static char *ovl_get_redirect(struct dentry *dentry, bool 
abs_redirect)
 
buflen -= thislen;
memcpy(&buf[buflen], name, thislen);
+  

[tip: sched/urgent] sched/fair: Don't assign runtime for throttled cfs_rq

2019-09-03 Thread tip-bot2 for Liangyan
The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 5e2d2cc2588bd3307ce3937acbc2ed03c830a861
Gitweb:
https://git.kernel.org/tip/5e2d2cc2588bd3307ce3937acbc2ed03c830a861
Author:Liangyan 
AuthorDate:Mon, 26 Aug 2019 20:16:33 +08:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 08:55:07 +02:00

sched/fair: Don't assign runtime for throttled cfs_rq

do_sched_cfs_period_timer() will refill cfs_b runtime and call
distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime
will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs
attached to this cfs_b can't get runtime and will be throttled.

We find that one throttled cfs_rq has non-negative
cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
in snippet:

  distribute_cfs_runtime() {
runtime = -cfs_rq->runtime_remaining + 1;
  }

The runtime here will change to a large number and consume all
cfs_b->runtime in this cfs_b period.

According to Ben Segall, the throttled cfs_rq can have
account_cfs_rq_runtime called on it because it is throttled before
idle_balance, and the idle_balance calls update_rq_clock to add time
that is accounted to the task.

This commit prevents cfs_rq to be assgined new runtime if it has been
throttled until that distribute_cfs_runtime is called.

Signed-off-by: Liangyan 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Valentin Schneider 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: shanp...@linux.alibaba.com
Cc: sta...@vger.kernel.org
Cc: xlp...@linux.alibaba.com
Fixes: d3d9dc330236 ("sched: Throttle entities exceeding their allowed 
bandwidth")
Link: 
https://lkml.kernel.org/r/20190826121633.6538-1-liangyan.p...@linux.alibaba.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfea..500f5db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4470,6 +4470,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq 
*cfs_rq, u64 delta_exec)
if (likely(cfs_rq->runtime_remaining > 0))
return;
 
+   if (cfs_rq->throttled)
+   return;
/*
 * if we're unable to extend our runtime we resched so that the active
 * hierarchy can be throttled
@@ -4673,6 +4675,9 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
*cfs_b,
if (!cfs_rq_throttled(cfs_rq))
goto next;
 
+   /* By the above check, this should never be true */
+   SCHED_WARN_ON(cfs_rq->runtime_remaining > 0);
+
runtime = -cfs_rq->runtime_remaining + 1;
if (runtime > remaining)
runtime = remaining;


Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq

2019-08-26 Thread Liangyan




On 19/8/27 上午1:38, bseg...@google.com wrote:

Valentin Schneider  writes:


On 23/08/2019 21:00, bseg...@google.com wrote:
[...]

Could you mention in the message that this a throttled cfs_rq can have
account_cfs_rq_runtime called on it because it is throttled before
idle_balance, and the idle_balance calls update_rq_clock to add time
that is accounted to the task.



Mayhaps even a comment for the extra condition.


I think this solution is less risky than unthrottling
in this area, so other than that:

Reviewed-by: Ben Segall 



If you don't mind squashing this in:

-8<-
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1d9cec9b1ed..b47b0bcf56bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4630,6 +4630,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
*cfs_b, u64 remaining)
if (!cfs_rq_throttled(cfs_rq))
goto next;
  
+		/* By the above check, this should never be true */

+   WARN_ON(cfs_rq->runtime_remaining > 0);
+
+   /* Pick the minimum amount to return to a positive quota state 
*/
runtime = -cfs_rq->runtime_remaining + 1;
if (runtime > remaining)
runtime = remaining;
->8-

I'm not adamant about the extra comment, but the WARN_ON would be nice IMO.


@Ben, do you reckon we want to strap

Cc: 
Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against 
bandwidth")

to the thing? AFAICT the pick_next_task_fair() + idle_balance() dance you
described should still be possible on that commit.


I'm not sure about stable policy in general, but it seems reasonable.
The WARN_ON might want to be WARN_ON_ONCE, and it seems fine to have it
or not.


Thanks Ben and Valentin for all of the comments. Per Xunlei's 
suggestion, I used SCHED_WARN_ON instead in v3. Regarding whether cc 
stable, I'm also not sure.







Other than that,

Reviewed-by: Valentin Schneider 

[...]


[PATCH v3] sched/fair: don't assign runtime for throttled cfs_rq

2019-08-26 Thread Liangyan
do_sched_cfs_period_timer() will refill cfs_b runtime and call
distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime
will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs
attached to this cfs_b can't get runtime and will be throttled.

We find that one throttled cfs_rq has non-negative
cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
in snippet: distribute_cfs_runtime() {
runtime = -cfs_rq->runtime_remaining + 1; }.
The runtime here will change to a large number and consume all
cfs_b->runtime in this cfs_b period.

According to Ben Segall, the throttled cfs_rq can have
account_cfs_rq_runtime called on it because it is throttled before
idle_balance, and the idle_balance calls update_rq_clock to add time
that is accounted to the task.

This commit prevents cfs_rq to be assgined new runtime if it has been
throttled until that distribute_cfs_runtime is called.

Signed-off-by: Liangyan 
Reviewed-by: Ben Segall 
Reviewed-by: Valentin Schneider 
---
 kernel/sched/fair.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd..500f5db0de0b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4470,6 +4470,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq 
*cfs_rq, u64 delta_exec)
if (likely(cfs_rq->runtime_remaining > 0))
return;
 
+   if (cfs_rq->throttled)
+   return;
/*
 * if we're unable to extend our runtime we resched so that the active
 * hierarchy can be throttled
@@ -4673,6 +4675,9 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
*cfs_b,
if (!cfs_rq_throttled(cfs_rq))
goto next;
 
+   /* By the above check, this should never be true */
+   SCHED_WARN_ON(cfs_rq->runtime_remaining > 0);
+
runtime = -cfs_rq->runtime_remaining + 1;
if (runtime > remaining)
runtime = remaining;
-- 
2.14.4.44.g2045bb6



[PATCH v2] sched/fair: don't assign runtime for throttled cfs_rq

2019-08-24 Thread Liangyan
do_sched_cfs_period_timer() will refill cfs_b runtime and call
distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime
will allocate all quota to one cfs_rq incorrectly.
This will cause other cfs_rq can't get runtime and will be throttled.
We find that one throttled cfs_rq has non-negative
cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
in snippet: distribute_cfs_runtime() {
runtime = -cfs_rq->runtime_remaining + 1; }.
This cast will cause that runtime will be a large number and
cfs_b->runtime will be subtracted to be zero at last.
According to Ben Segall, the throttled cfs_rq can have
account_cfs_rq_runtime called on it because it is throttled before
idle_balance, and the idle_balance calls update_rq_clock to add time
that is accounted to the task.

This commit prevents cfs_rq to be assgined new runtime if it has been
throttled to avoid the above incorrect type cast.

Signed-off-by: Liangyan 
Reviewed-by: Ben Segall 
Reviewed-by: Valentin Schneider 
---
 kernel/sched/fair.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd..ac3ae694d850 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4470,6 +4470,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq 
*cfs_rq, u64 delta_exec)
if (likely(cfs_rq->runtime_remaining > 0))
return;
 
+   if (cfs_rq->throttled)
+   return;
/*
 * if we're unable to extend our runtime we resched so that the active
 * hierarchy can be throttled
@@ -4673,6 +4675,9 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
*cfs_b,
if (!cfs_rq_throttled(cfs_rq))
goto next;
 
+   /* By the above check, this should never be true */
+   WARN_ON(cfs_rq->runtime_remaining > 0);
+
runtime = -cfs_rq->runtime_remaining + 1;
if (runtime > remaining)
runtime = remaining;
-- 
2.14.4.44.g2045bb6



Re: [PATCH] sched/fair: Add missing unthrottle_cfs_rq()

2019-08-23 Thread Liangyan

Resend.

Sorry that my previous email has format issue.

On 19/8/23 上午2:48, bseg...@google.com wrote:

Valentin Schneider  writes:


Turns out a cfs_rq->runtime_remaining can become positive in
assign_cfs_rq_runtime(), but this codepath has no call to
unthrottle_cfs_rq().

This can leave us in a situation where we have a throttled cfs_rq with
positive ->runtime_remaining, which breaks the math in
distribute_cfs_runtime(): this function expects a negative value so that
it may safely negate it into a positive value.

Add the missing unthrottle_cfs_rq(). While at it, add a WARN_ON where
we expect negative values, and pull in a comment from the mailing list
that didn't make it in [1].

[1]: 
https://lkml.kernel.org/r/BANLkTi=nmcxkx6ebdqcjydj5kkyg2n1...@mail.gmail.com

Cc: 
Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against 
bandwidth")
Reported-by: Liangyan 
Signed-off-by: Valentin Schneider 


Having now seen the rest of the thread:

Could you send the repro, as it doesn't seem to have reached lkml, so
that I can confirm my guess as to what's going on?

It seems most likely we throttle during one of the remove-change-adds in
set_cpus_allowed and friends or during the put half of pick_next_task
followed by idle balance to drop the lock. Then distribute races with a
later assign_cfs_rq_runtime so that the account finds runtime in the
cfs_b.

pick_next_task_fair calls update_curr but get zero runtime because of 
cfs_b->runtime=0, then throttle current cfs_rq because of 
cfs_rq->runtime_remaining=0, then call put_prev_entity to 
__account_cfs_rq_runtime to assign new runtime since dequeue_entity on 
another cpu just call return_cfs_rq_runtime to return some runtime to 
cfs_b->runtime.



Please check below debug log to trace this logic.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..0da556c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4395,6 +4395,12 @@ static void __account_cfs_rq_runtime(struct 
cfs_rq *cfs_rq, u64 delta_exec)

if (likely(cfs_rq->runtime_remaining > 0))
return;

+   if (cfs_rq->throttled && smp_processor_id()==20) {
+ 
pr_err("__account_cfs_rq_runtime:cfs_rq=%p,comm=%s,pid=%d,cpu=%d\n",cfs_rq,current->comm,current->pid,smp_processor_id());

+   dump_stack();
+   }
+   //if (cfs_rq->throttled)
+   //  return;
/*
 * if we're unable to extend our runtime we resched so that the 
active

 * hierarchy can be throttled
@@ -4508,6 +4514,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
sub_nr_running(rq, task_delta);

cfs_rq->throttled = 1;
+   {
+   if (cfs_rq->nr_running > 0 && smp_processor_id()==20) {
+ 
pr_err("throttle_cfs_rq:cfs_rq=%p,comm=%s,pid=%d,cpu=%d\n",cfs_rq,current->comm,current->pid,smp_processor_id());

+   dump_stack();
+   }
+   }
cfs_rq->throttled_clock = rq_clock(rq);
raw_spin_lock(&cfs_b->lock);
empty = list_empty(&cfs_b->throttled_cfs_rq);


[  257.406397] 
throttle_cfs_rq:cfs_rq=b012d731,comm=delay,pid=4307,cpu=20
[  257.407251] CPU: 20 PID: 4307 Comm: delay Tainted: GW   E 
5.2.0-rc3+ #9

[  257.408795] Call Trace:
[  257.409085]  dump_stack+0x5c/0x7b
[  257.409482]  throttle_cfs_rq+0x2c3/0x2d0
[  257.409940]  check_cfs_rq_runtime+0x2f/0x50
[  257.410430]  pick_next_task_fair+0xb1/0x740
[  257.410918]  __schedule+0x104/0x670
[  257.411333]  schedule+0x33/0x90
[  257.411706]  exit_to_usermode_loop+0x6a/0xf0
[  257.412201]  prepare_exit_to_usermode+0x80/0xc0
[  257.412734]  retint_user+0x8/0x8
[  257.413114] RIP: 0033:0x4006d0
[  257.413480] Code: 7d f8 48 8b 45 f8 48 85 c0 74 24 eb 0d 0f 1f 00 66 
2e 0f 1f 84 00 00 00 00 00 eb 0e 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 
00 <48> ff c8 75 fb 48 ff c8 90 5d c3 55 48 89 e5 48 83 ec 18 48 89 7d
[  257.415630] RSP: 002b:7f9b74abbe90 EFLAGS: 0206 ORIG_RAX: 
ff13
[  257.416508] RAX: 00060f7b RBX:  RCX: 

[  257.417333] RDX: 002625b3 RSI:  RDI: 
002625b4
[  257.418155] RBP: 7f9b74abbe90 R08: 7f9b74abc700 R09: 
7f9b74abc700
[  257.418983] R10: 7f9b74abc9d0 R11:  R12: 
7ffee72e1afe
[  257.419813] R13: 7ffee72e1aff R14: 7ffee72e1b00 R15: 




[  257.420718] 
__account_cfs_rq_runtime:cfs_rq=b012d731,comm=delay,pid=4307,cpu=20
[  257.421646] CPU: 20 PID: 4307 Comm: delay Tainted: GW   E 
5.2.0-rc3+ #9

[  257.422538] Call Trace:
[  257.424712]  dump_stack+0x5c/0x7b
[  257.425101]  __account_cfs_rq_runtime+0x1d7/0x1e0
[  257.425656]  put_prev_entity+0x90/0x100
[  257.426102]  pick_next_task_fair+0x334/0x740
[  257.426605]  __schedule+0x104/0x670
[  257.427013]  schedule+0x3

[PATCH] sched/fair: don't assign runtime for throttled cfs_rq

2019-08-14 Thread Liangyan
do_sched_cfs_period_timer() will refill cfs_b runtime and call
distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime
will allocate all quota to one cfs_rq incorrectly.
This will cause other cfs_rq can't get runtime and will be throttled.
We find that one throttled cfs_rq has non-negative
cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
in snippet: distribute_cfs_runtime() {
runtime = -cfs_rq->runtime_remaining + 1; }.
This cast will cause that runtime will be a large number and
cfs_b->runtime will be subtracted to be zero at last.

This commit prevents cfs_rq to be assgined new runtime if it has been
throttled to avoid the above incorrect type cast.

Signed-off-by: Liangyan 
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81fd8a2a605b..b14d67d28138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4068,6 +4068,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq 
*cfs_rq, u64 delta_exec)
if (likely(cfs_rq->runtime_remaining > 0))
return;
 
+   if (cfs_rq->throttled)
+   return;
/*
 * if we're unable to extend our runtime we resched so that the active
 * hierarchy can be throttled
-- 
2.14.4.44.g2045bb6



[PATCH] sched/fair: don't restart enqueued cfs quota slack timer

2019-06-02 Thread Liangyan
From: "liangyan.ply" 

start_cfs_slack_bandwidth() will restart the quota slack timer,
if it is called frequently, this timer will be restarted continuously
and may have no chance to expire to unthrottle cfs tasks.
This will cause that the throttled tasks can't be unthrottled in time
although they have remaining quota.

Signed-off-by: Liangyan 
---
 kernel/sched/fair.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d90a64620072..fdb03c752f97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4411,9 +4411,11 @@ static void start_cfs_slack_bandwidth(struct 
cfs_bandwidth *cfs_b)
if (runtime_refresh_within(cfs_b, min_left))
return;
 
-   hrtimer_start(&cfs_b->slack_timer,
+   if (!hrtimer_active(&cfs_b->slack_timer)) {
+   hrtimer_start(&cfs_b->slack_timer,
ns_to_ktime(cfs_bandwidth_slack_period),
HRTIMER_MODE_REL);
+   }
 }
 
 /* we know any runtime found here is valid as update_curr() precedes return */
-- 
2.14.4.44.g2045bb6