Re: [PATCH v4] ovl: fix dentry leak in ovl_get_redirect
ping On 20/12/22 上午11:06, Liangyan wrote: We need to lock d_parent->d_lock before dget_dlock, or this may have d_lockref updated parallelly like calltrace below which will cause dentry->d_lockref leak and risk a crash. CPU 0CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_dlock //no lock protection herespin_lock(&dentry->d_lock) dentry->d_lockref.count++dentry->d_lockref.count++ [ 49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0 [ 49.799689] Oops: 0002 [#1] SMP PTI [ 49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1 [ 49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [ 49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.803949] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.804600] RDX: 0001 RSI: 000a RDI: 0088 [ 49.805252] RBP: R08: R09: 993cf040 [ 49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.807200] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.807935] CS: 0010 DS: ES: CR0: 80050033 [ 49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.809113] DR0: DR1: DR2: [ 49.809758] DR3: DR6: fffe0ff0 DR7: 0400 [ 49.810410] Call Trace: [ 49.810653] d_delete+0x2c/0xb0 [ 49.810951] vfs_rmdir+0xfd/0x120 [ 49.811264] do_rmdir+0x14f/0x1a0 [ 49.811573] do_syscall_64+0x5b/0x190 [ 49.811917] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 49.812385] RIP: 0033:0x7ffbf505ffd7 [ 49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 0054 [ 49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7 [ 49.815744] RDX: 04449700 RSI: RDI: 06c8cd50 [ 49.816394] RBP: 7ffbedffaea0 R08: R09: 00017d0b [ 49.817038] R10: R11: 0297 R12: 0012 [ 49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8 [ 49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042 [ 49.819052] CR2: 0088 [ 49.819368] ---[ end trace 4e652b8aa299aa2d ]--- [ 49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.822363] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.823008] RDX: 0001 RSI: 000a RDI: 0088 [ 49.823658] RBP: R08: R09: 993cf040 [ 49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.828890] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.830725] CS: 0010 DS: ES: CR0: 80050033 [ 49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.834085] DR0: DR1: DR2: [ 49.835792] DR3: DR6: fffe0ff0 DR7: 0400 Cc: Fixes: a6c606551141 ("ovl: redirect on rename-dir") Signed-off-by: Liangyan Reviewed-by: Joseph Qi Suggested-by: Al Viro --- fs/overlayfs/dir.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index 28a075b5f5b2..d1efa3a5a503 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -992,8 +992,8 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) buflen -= thislen; memcpy(&buf[buflen], name, thislen); - tmp = dget_dlock(d->d_parent); spin_unlock(&d->d_lock); + tmp = dget_parent(d); dput(d); d = tmp;
Re: [PATCH v4] ovl: fix dentry leak in ovl_get_redirect
Thanks Viro. @Miklos, can you please advise? On 20/12/22 上午11:26, Al Viro wrote: On Tue, Dec 22, 2020 at 11:06:26AM +0800, Liangyan wrote: Cc: Fixes: a6c606551141 ("ovl: redirect on rename-dir") Signed-off-by: Liangyan Reviewed-by: Joseph Qi Suggested-by: Al Viro Fine by me... I can put it through vfs.git#fixes, but IMO that would be better off in overlayfs tree.
[PATCH v4] ovl: fix dentry leak in ovl_get_redirect
We need to lock d_parent->d_lock before dget_dlock, or this may have d_lockref updated parallelly like calltrace below which will cause dentry->d_lockref leak and risk a crash. CPU 0CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_dlock //no lock protection herespin_lock(&dentry->d_lock) dentry->d_lockref.count++dentry->d_lockref.count++ [ 49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0 [ 49.799689] Oops: 0002 [#1] SMP PTI [ 49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1 [ 49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [ 49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.803949] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.804600] RDX: 0001 RSI: 000a RDI: 0088 [ 49.805252] RBP: R08: R09: 993cf040 [ 49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.807200] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.807935] CS: 0010 DS: ES: CR0: 80050033 [ 49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.809113] DR0: DR1: DR2: [ 49.809758] DR3: DR6: fffe0ff0 DR7: 0400 [ 49.810410] Call Trace: [ 49.810653] d_delete+0x2c/0xb0 [ 49.810951] vfs_rmdir+0xfd/0x120 [ 49.811264] do_rmdir+0x14f/0x1a0 [ 49.811573] do_syscall_64+0x5b/0x190 [ 49.811917] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 49.812385] RIP: 0033:0x7ffbf505ffd7 [ 49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 0054 [ 49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7 [ 49.815744] RDX: 04449700 RSI: RDI: 06c8cd50 [ 49.816394] RBP: 7ffbedffaea0 R08: R09: 00017d0b [ 49.817038] R10: R11: 0297 R12: 0012 [ 49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8 [ 49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042 [ 49.819052] CR2: 0088 [ 49.819368] ---[ end trace 4e652b8aa299aa2d ]--- [ 49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.822363] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.823008] RDX: 0001 RSI: 000a RDI: 0088 [ 49.823658] RBP: R08: R09: 993cf040 [ 49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.828890] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.830725] CS: 0010 DS: ES: CR0: 80050033 [ 49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.834085] DR0: DR1: DR2: [ 49.835792] DR3: DR6: fffe0ff0 DR7: 0400 Cc: Fixes: a6c606551141 ("ovl: redirect on rename-dir") Signed-off-by: Liangyan Reviewed-by: Joseph Qi Suggested-by: Al Viro --- fs/overlayfs/dir.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index 28a075b5f5b2..d1efa3a5a503 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -992,8 +992,8 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) buflen -= thislen; memcpy(&buf[buflen], name, thislen); - tmp = dget_dlock(d->d_parent); spin_unlock(&d->d_lock); + tmp = dget_parent(d); dput(d); d = tmp; -- 2.14.4.44.g2045bb6
[PATCH v3] ovl: fix dentry leak in ovl_get_redirect
We need to lock d_parent->d_lock before dget_dlock, or this may have d_lockref updated parallelly like calltrace below which will cause dentry->d_lockref leak and risk a crash. CPU 0CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_dlock //no lock protection herespin_lock(&dentry->d_lock) dentry->d_lockref.count++dentry->d_lockref.count++ [ 49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0 [ 49.799689] Oops: 0002 [#1] SMP PTI [ 49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1 [ 49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [ 49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.803949] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.804600] RDX: 0001 RSI: 000a RDI: 0088 [ 49.805252] RBP: R08: R09: 993cf040 [ 49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.807200] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.807935] CS: 0010 DS: ES: CR0: 80050033 [ 49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.809113] DR0: DR1: DR2: [ 49.809758] DR3: DR6: fffe0ff0 DR7: 0400 [ 49.810410] Call Trace: [ 49.810653] d_delete+0x2c/0xb0 [ 49.810951] vfs_rmdir+0xfd/0x120 [ 49.811264] do_rmdir+0x14f/0x1a0 [ 49.811573] do_syscall_64+0x5b/0x190 [ 49.811917] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 49.812385] RIP: 0033:0x7ffbf505ffd7 [ 49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 0054 [ 49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7 [ 49.815744] RDX: 04449700 RSI: RDI: 06c8cd50 [ 49.816394] RBP: 7ffbedffaea0 R08: R09: 00017d0b [ 49.817038] R10: R11: 0297 R12: 0012 [ 49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8 [ 49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042 [ 49.819052] CR2: 0088 [ 49.819368] ---[ end trace 4e652b8aa299aa2d ]--- [ 49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.822363] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.823008] RDX: 0001 RSI: 000a RDI: 0088 [ 49.823658] RBP: R08: R09: 993cf040 [ 49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.828890] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.830725] CS: 0010 DS: ES: CR0: 80050033 [ 49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.834085] DR0: DR1: DR2: [ 49.835792] DR3: DR6: fffe0ff0 DR7: 0400 Fixes: a6c606551141 ("ovl: redirect on rename-dir") Signed-off-by: Liangyan Reviewed-by: Joseph Qi Suggested-by: Al Viro --- fs/overlayfs/dir.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index 28a075b5f5b2..e9aa4a12ad82 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) for (d = dget(dentry); !IS_ROOT(d);) { const char *name; int thislen; + struct dentry *parent = NULL; spin_lock(&d->d_lock); name = ovl_dentry_get_redirect(d); @@ -992,11 +993,10 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) buflen -= thislen; memcpy(&buf[buflen], name, thislen); - tmp = dget_dlock(d->d_parent); spin_unlock(&d->d_lock); - + parent = dget_parent(d); dput(d); - d = tmp; + d = parent; /* Absolute redirect: finished */ if (buf[buflen] == '/') -- 2.14.4.44.g2045bb6
Re: [PATCH v2] ovl: fix dentry leak in ovl_get_redirect
Exactly, i missed this definition of d_lock and treat it as a single member in dentry. #define d_lock d_lockref.lock Thanks for the explanation. i will post a new patch as your suggestion. Regards, Liangyan On 20/12/22 上午1:35, Al Viro wrote: On Tue, Dec 22, 2020 at 12:51:27AM +0800, Liangyan wrote: This is the race scenario based on call trace we captured which cause the dentry leak. CPU 0CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_dlock //no lock protection herespin_lock(&dentry->d_lock) dentry->d_lockref.count++dentry->d_lockref.count++ If we use dget_parent instead, we may have this race. CPU 0CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_parent raw_seqcount_begin(&dentry->d_seq) spin_lock(&dentry->d_lock) lockref_get_not_zero(&ret->d_lockref) dentry->d_lockref.count++ And? lockref_get_not_zero() will observe ->d_lock held and fall back to taking it. The whole point of lockref is that counter and spinlock are next to each other. Fastpath in lockref_get_not_zero is cmpxchg on both, and it is taken only if ->d_lock is *NOT* locked. And the slow path there will do spin_lock() around the manipulations of ->count. Note that ->d_lock is simply ->d_lockref.lock; ->d_seq has nothing to do with the whole thing. The race in mainline is real; if you can observe anything of that sort with dget_parent(), we have much worse problem. Consider dget() vs. lookup_fast() - no overlayfs weirdness in sight and the same kind of concurrent access. Again, lockref primitives can be safely mixed with other threads doing operations on ->count while holding ->lock.
Re: [PATCH v2] ovl: fix dentry leak in ovl_get_redirect
This is the race scenario based on call trace we captured which cause the dentry leak. CPU 0CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_dlock //no lock protection herespin_lock(&dentry->d_lock) dentry->d_lockref.count++dentry->d_lockref.count++ If we use dget_parent instead, we may have this race. CPU 0CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_parent raw_seqcount_begin(&dentry->d_seq) spin_lock(&dentry->d_lock) lockref_get_not_zero(&ret->d_lockref) dentry->d_lockref.count++ On 20/12/21 下午8:11, Al Viro wrote: On Mon, Dec 21, 2020 at 07:14:44PM +0800, Joseph Qi wrote: Hi Viro, On 12/21/20 2:26 PM, Al Viro wrote: On Sun, Dec 20, 2020 at 08:09:27PM +0800, Liangyan wrote: +++ b/fs/overlayfs/dir.c @@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) for (d = dget(dentry); !IS_ROOT(d);) { const char *name; int thislen; + struct dentry *parent = NULL; spin_lock(&d->d_lock); name = ovl_dentry_get_redirect(d); @@ -992,7 +993,22 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) buflen -= thislen; memcpy(&buf[buflen], name, thislen); - tmp = dget_dlock(d->d_parent); + parent = d->d_parent; + if (unlikely(!spin_trylock(&parent->d_lock))) { + rcu_read_lock(); + spin_unlock(&d->d_lock); +again: + parent = READ_ONCE(d->d_parent); + spin_lock(&parent->d_lock); + if (unlikely(parent != d->d_parent)) { + spin_unlock(&parent->d_lock); + goto again; + } + rcu_read_unlock(); + spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED); + } + tmp = dget_dlock(parent); + spin_unlock(&parent->d_lock); spin_unlock(&d->d_lock); Yecc What's wrong with just doing spin_unlock(&d->d_lock); parent = dget_parent(d); dput(d); d = parent; instead of that? Now race happens on non-RCU path in lookup_fast(), I'm afraid d_seq can not close the race window. Explain, please. What exactly are you observing?
Re: [PATCH v2] ovl: fix dentry leak in ovl_get_redirect
Guys, any comments on this patch? This issue should exist in latest upstream. On 20/12/20 下午8:09, Liangyan wrote: We need to lock d_parent->d_lock before dget_dlock, or this may have d_lockref updated parallelly like calltrace below which will cause dentry->d_lockref leak and risk a crash. npm-20576 [028] 5705749.040094: [28] ovl_set_redirect+0x11c/0x310 //tmp = dget_dlock(d->d_parent); [28]? ovl_set_redirect+0x5/0x310 [28] ovl_rename+0x4db/0x790 [overlay] [28] vfs_rename+0x6e8/0x920 [28] do_renameat2+0x4d6/0x560 [28] __x64_sys_rename+0x1c/0x20 [28] do_syscall_64+0x55/0x1a0 [28] entry_SYSCALL_64_after_hwframe+0x44/0xa9 npm-20574 [036] 5705749.040094: [36] __d_lookup+0x107/0x140 //dentry->d_lockref.count++; [36] lookup_fast+0xe0/0x2d0 [36] walk_component+0x48/0x350 [36] link_path_walk+0x1bf/0x650 [36]? path_init+0x1f6/0x2f0 [36] path_lookupat+0x82/0x210 [36] filename_lookup+0xb8/0x1a0 [36]? __audit_getname+0xa2/0xb0 [36]? getname_flags+0xb9/0x1e0 [36]? vfs_statx+0x73/0xe0 [36] vfs_statx+0x73/0xe0 [36] __do_sys_statx+0x3b/0x80 [36]? syscall_trace_enter+0x1ae/0x2c0 [36] do_syscall_64+0x55/0x1a0 [36] entry_SYSCALL_64_ [ 49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0 [ 49.799689] Oops: 0002 [#1] SMP PTI [ 49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1 [ 49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [ 49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.803949] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.804600] RDX: 0001 RSI: 000a RDI: 0088 [ 49.805252] RBP: R08: R09: 993cf040 [ 49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.807200] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.807935] CS: 0010 DS: ES: CR0: 80050033 [ 49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.809113] DR0: DR1: DR2: [ 49.809758] DR3: DR6: fffe0ff0 DR7: 0400 [ 49.810410] Call Trace: [ 49.810653] d_delete+0x2c/0xb0 [ 49.810951] vfs_rmdir+0xfd/0x120 [ 49.811264] do_rmdir+0x14f/0x1a0 [ 49.811573] do_syscall_64+0x5b/0x190 [ 49.811917] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 49.812385] RIP: 0033:0x7ffbf505ffd7 [ 49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 0054 [ 49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7 [ 49.815744] RDX: 04449700 RSI: RDI: 06c8cd50 [ 49.816394] RBP: 7ffbedffaea0 R08: R09: 00017d0b [ 49.817038] R10: R11: 0297 R12: 0012 [ 49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8 [ 49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042 [ 49.819052] CR2: 0088 [ 49.819368] ---[ end trace 4e652b8aa299aa2d ]--- [ 49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.822363] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.823008] RDX: 0001 RSI: 000a RDI: 0088 [ 49.823658] RBP: R08: R09: 993cf040 [ 49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.828890] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.830725] CS: 0010 DS: ES: CR0: 80050033 [ 49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.834085] DR0: DR1: DR2: [ 49.835792] DR3: DR6: fffe0ff0 DR7: 0400 Fixes: a6c606551141 ("ovl: redirect on rename-dir") Signed-off-by: Liangyan Suggested-by: Joseph Qi --- fs/overlayfs/dir.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index 28a075b5f5b2..a78d35017371 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) for (d = dget(dentry); !IS_ROOT(d);) { const char *name; int thislen; + struct dentry *parent = NULL; spin_lock(&d->d_lock); name = ovl_dentry_get_redirect(d); @@ -992,7 +993,22 @@ static char *ovl_get_redirect(struct dentr
[PATCH v2] ovl: fix dentry leak in ovl_get_redirect
We need to lock d_parent->d_lock before dget_dlock, or this may have d_lockref updated parallelly like calltrace below which will cause dentry->d_lockref leak and risk a crash. npm-20576 [028] 5705749.040094: [28] ovl_set_redirect+0x11c/0x310 //tmp = dget_dlock(d->d_parent); [28]? ovl_set_redirect+0x5/0x310 [28] ovl_rename+0x4db/0x790 [overlay] [28] vfs_rename+0x6e8/0x920 [28] do_renameat2+0x4d6/0x560 [28] __x64_sys_rename+0x1c/0x20 [28] do_syscall_64+0x55/0x1a0 [28] entry_SYSCALL_64_after_hwframe+0x44/0xa9 npm-20574 [036] 5705749.040094: [36] __d_lookup+0x107/0x140 //dentry->d_lockref.count++; [36] lookup_fast+0xe0/0x2d0 [36] walk_component+0x48/0x350 [36] link_path_walk+0x1bf/0x650 [36]? path_init+0x1f6/0x2f0 [36] path_lookupat+0x82/0x210 [36] filename_lookup+0xb8/0x1a0 [36]? __audit_getname+0xa2/0xb0 [36]? getname_flags+0xb9/0x1e0 [36]? vfs_statx+0x73/0xe0 [36] vfs_statx+0x73/0xe0 [36] __do_sys_statx+0x3b/0x80 [36]? syscall_trace_enter+0x1ae/0x2c0 [36] do_syscall_64+0x55/0x1a0 [36] entry_SYSCALL_64_ [ 49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0 [ 49.799689] Oops: 0002 [#1] SMP PTI [ 49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1 [ 49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [ 49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.803949] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.804600] RDX: 0001 RSI: 000a RDI: 0088 [ 49.805252] RBP: R08: R09: 993cf040 [ 49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.807200] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.807935] CS: 0010 DS: ES: CR0: 80050033 [ 49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.809113] DR0: DR1: DR2: [ 49.809758] DR3: DR6: fffe0ff0 DR7: 0400 [ 49.810410] Call Trace: [ 49.810653] d_delete+0x2c/0xb0 [ 49.810951] vfs_rmdir+0xfd/0x120 [ 49.811264] do_rmdir+0x14f/0x1a0 [ 49.811573] do_syscall_64+0x5b/0x190 [ 49.811917] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 49.812385] RIP: 0033:0x7ffbf505ffd7 [ 49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 0054 [ 49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7 [ 49.815744] RDX: 04449700 RSI: RDI: 06c8cd50 [ 49.816394] RBP: 7ffbedffaea0 R08: R09: 00017d0b [ 49.817038] R10: R11: 0297 R12: 0012 [ 49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8 [ 49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042 [ 49.819052] CR2: 0088 [ 49.819368] ---[ end trace 4e652b8aa299aa2d ]--- [ 49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.822363] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.823008] RDX: 0001 RSI: 000a RDI: 0088 [ 49.823658] RBP: R08: R09: 993cf040 [ 49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.828890] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.830725] CS: 0010 DS: ES: CR0: 80050033 [ 49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.834085] DR0: DR1: DR2: [ 49.835792] DR3: DR6: fffe0ff0 DR7: 0400 Fixes: a6c606551141 ("ovl: redirect on rename-dir") Signed-off-by: Liangyan Suggested-by: Joseph Qi --- fs/overlayfs/dir.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index 28a075b5f5b2..a78d35017371 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) for (d = dget(dentry); !IS_ROOT(d);) { const char *name; int thislen; + struct dentry *parent = NULL; spin_lock(&d->d_lock); name = ovl_dentry_get_redirect(d); @@ -992,7 +993,22 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) buflen -= thislen; memcpy(&buf[buflen], name, thislen);
[PATCH] ovl: fix dentry leak in ovl_get_redirect
We need to lock d_parent->d_lock before dget_dlock, or this may have d_lockref updated parallelly like calltrace below which will cause dentry->d_lockref leak and risk a crash. npm-20576 [028] 5705749.040094: [28] ovl_set_redirect+0x11c/0x310 //tmp = dget_dlock(d->d_parent); [28]? ovl_set_redirect+0x5/0x310 [28] ovl_rename+0x4db/0x790 [overlay] [28] vfs_rename+0x6e8/0x920 [28] do_renameat2+0x4d6/0x560 [28] __x64_sys_rename+0x1c/0x20 [28] do_syscall_64+0x55/0x1a0 [28] entry_SYSCALL_64_after_hwframe+0x44/0xa9 npm-20574 [036] 5705749.040094: [36] __d_lookup+0x107/0x140 //dentry->d_lockref.count++; [36] lookup_fast+0xe0/0x2d0 [36] walk_component+0x48/0x350 [36] link_path_walk+0x1bf/0x650 [36]? path_init+0x1f6/0x2f0 [36] path_lookupat+0x82/0x210 [36] filename_lookup+0xb8/0x1a0 [36]? __audit_getname+0xa2/0xb0 [36]? getname_flags+0xb9/0x1e0 [36]? vfs_statx+0x73/0xe0 [36] vfs_statx+0x73/0xe0 [36] __do_sys_statx+0x3b/0x80 [36]? syscall_trace_enter+0x1ae/0x2c0 [36] do_syscall_64+0x55/0x1a0 [36] entry_SYSCALL_64_ [ 49.799059] PGD 80061fed7067 P4D 80061fed7067 PUD 61fec5067 PMD 0 [ 49.799689] Oops: 0002 [#1] SMP PTI [ 49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1 [ 49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [ 49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.803470] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.803949] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.804600] RDX: 0001 RSI: 000a RDI: 0088 [ 49.805252] RBP: R08: R09: 993cf040 [ 49.805898] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.806548] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.807200] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.807935] CS: 0010 DS: ES: CR0: 80050033 [ 49.808461] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.809113] DR0: DR1: DR2: [ 49.809758] DR3: DR6: fffe0ff0 DR7: 0400 [ 49.810410] Call Trace: [ 49.810653] d_delete+0x2c/0xb0 [ 49.810951] vfs_rmdir+0xfd/0x120 [ 49.811264] do_rmdir+0x14f/0x1a0 [ 49.811573] do_syscall_64+0x5b/0x190 [ 49.811917] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 49.812385] RIP: 0033:0x7ffbf505ffd7 [ 49.814404] RSP: 002b:7ffbedffada8 EFLAGS: 0297 ORIG_RAX: 0054 [ 49.815098] RAX: ffda RBX: 7ffbedffb640 RCX: 7ffbf505ffd7 [ 49.815744] RDX: 04449700 RSI: RDI: 06c8cd50 [ 49.816394] RBP: 7ffbedffaea0 R08: R09: 00017d0b [ 49.817038] R10: R11: 0297 R12: 0012 [ 49.817687] R13: 072823d8 R14: 7ffbedffb700 R15: 072823d8 [ 49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042 [ 49.819052] CR2: 0088 [ 49.819368] ---[ end trace 4e652b8aa299aa2d ]--- [ 49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.821880] RSP: 0018:ac6fc5417e98 EFLAGS: 00010246 [ 49.822363] RAX: RBX: 93b8da3446c0 RCX: 000a [ 49.823008] RDX: 0001 RSI: 000a RDI: 0088 [ 49.823658] RBP: R08: R09: 993cf040 [ 49.825404] R10: 93b92292e580 R11: d27f188a4b80 R12: [ 49.827147] R13: ff9c R14: fffe R15: 93b8da3446c0 [ 49.828890] FS: 7ffbedffb700() GS:93b92788() knlGS: [ 49.830725] CS: 0010 DS: ES: CR0: 80050033 [ 49.832359] CR2: 0088 CR3: 0005e3f74006 CR4: 003606a0 [ 49.834085] DR0: DR1: DR2: [ 49.835792] DR3: DR6: fffe0ff0 DR7: 0400 Fixes: a6c606551141 ("ovl: redirect on rename-dir") Signed-off-by: Liangyan Suggested-by: Joseph Qi --- fs/overlayfs/dir.c | 20 1 file changed, 20 insertions(+) diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index 28a075b5f5b2..9831e7046038 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -973,6 +973,7 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) for (d = dget(dentry); !IS_ROOT(d);) { const char *name; int thislen; + struct dentry *parent = NULL; spin_lock(&d->d_lock); name = ovl_dentry_get_redirect(d); @@ -992,7 +993,26 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect) buflen -= thislen; memcpy(&buf[buflen], name, thislen); +
[tip: sched/urgent] sched/fair: Don't assign runtime for throttled cfs_rq
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 5e2d2cc2588bd3307ce3937acbc2ed03c830a861 Gitweb: https://git.kernel.org/tip/5e2d2cc2588bd3307ce3937acbc2ed03c830a861 Author:Liangyan AuthorDate:Mon, 26 Aug 2019 20:16:33 +08:00 Committer: Ingo Molnar CommitterDate: Tue, 03 Sep 2019 08:55:07 +02:00 sched/fair: Don't assign runtime for throttled cfs_rq do_sched_cfs_period_timer() will refill cfs_b runtime and call distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs attached to this cfs_b can't get runtime and will be throttled. We find that one throttled cfs_rq has non-negative cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64 in snippet: distribute_cfs_runtime() { runtime = -cfs_rq->runtime_remaining + 1; } The runtime here will change to a large number and consume all cfs_b->runtime in this cfs_b period. According to Ben Segall, the throttled cfs_rq can have account_cfs_rq_runtime called on it because it is throttled before idle_balance, and the idle_balance calls update_rq_clock to add time that is accounted to the task. This commit prevents cfs_rq to be assgined new runtime if it has been throttled until that distribute_cfs_runtime is called. Signed-off-by: Liangyan Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Reviewed-by: Ben Segall Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: shanp...@linux.alibaba.com Cc: sta...@vger.kernel.org Cc: xlp...@linux.alibaba.com Fixes: d3d9dc330236 ("sched: Throttle entities exceeding their allowed bandwidth") Link: https://lkml.kernel.org/r/20190826121633.6538-1-liangyan.p...@linux.alibaba.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc9cfea..500f5db 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4470,6 +4470,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) if (likely(cfs_rq->runtime_remaining > 0)) return; + if (cfs_rq->throttled) + return; /* * if we're unable to extend our runtime we resched so that the active * hierarchy can be throttled @@ -4673,6 +4675,9 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, if (!cfs_rq_throttled(cfs_rq)) goto next; + /* By the above check, this should never be true */ + SCHED_WARN_ON(cfs_rq->runtime_remaining > 0); + runtime = -cfs_rq->runtime_remaining + 1; if (runtime > remaining) runtime = remaining;
Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
On 19/8/27 上午1:38, bseg...@google.com wrote: Valentin Schneider writes: On 23/08/2019 21:00, bseg...@google.com wrote: [...] Could you mention in the message that this a throttled cfs_rq can have account_cfs_rq_runtime called on it because it is throttled before idle_balance, and the idle_balance calls update_rq_clock to add time that is accounted to the task. Mayhaps even a comment for the extra condition. I think this solution is less risky than unthrottling in this area, so other than that: Reviewed-by: Ben Segall If you don't mind squashing this in: -8<- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b1d9cec9b1ed..b47b0bcf56bc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4630,6 +4630,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) if (!cfs_rq_throttled(cfs_rq)) goto next; + /* By the above check, this should never be true */ + WARN_ON(cfs_rq->runtime_remaining > 0); + + /* Pick the minimum amount to return to a positive quota state */ runtime = -cfs_rq->runtime_remaining + 1; if (runtime > remaining) runtime = remaining; ->8- I'm not adamant about the extra comment, but the WARN_ON would be nice IMO. @Ben, do you reckon we want to strap Cc: Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth") to the thing? AFAICT the pick_next_task_fair() + idle_balance() dance you described should still be possible on that commit. I'm not sure about stable policy in general, but it seems reasonable. The WARN_ON might want to be WARN_ON_ONCE, and it seems fine to have it or not. Thanks Ben and Valentin for all of the comments. Per Xunlei's suggestion, I used SCHED_WARN_ON instead in v3. Regarding whether cc stable, I'm also not sure. Other than that, Reviewed-by: Valentin Schneider [...]
[PATCH v3] sched/fair: don't assign runtime for throttled cfs_rq
do_sched_cfs_period_timer() will refill cfs_b runtime and call distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs attached to this cfs_b can't get runtime and will be throttled. We find that one throttled cfs_rq has non-negative cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64 in snippet: distribute_cfs_runtime() { runtime = -cfs_rq->runtime_remaining + 1; }. The runtime here will change to a large number and consume all cfs_b->runtime in this cfs_b period. According to Ben Segall, the throttled cfs_rq can have account_cfs_rq_runtime called on it because it is throttled before idle_balance, and the idle_balance calls update_rq_clock to add time that is accounted to the task. This commit prevents cfs_rq to be assgined new runtime if it has been throttled until that distribute_cfs_runtime is called. Signed-off-by: Liangyan Reviewed-by: Ben Segall Reviewed-by: Valentin Schneider --- kernel/sched/fair.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc9cfeaac8bd..500f5db0de0b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4470,6 +4470,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) if (likely(cfs_rq->runtime_remaining > 0)) return; + if (cfs_rq->throttled) + return; /* * if we're unable to extend our runtime we resched so that the active * hierarchy can be throttled @@ -4673,6 +4675,9 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, if (!cfs_rq_throttled(cfs_rq)) goto next; + /* By the above check, this should never be true */ + SCHED_WARN_ON(cfs_rq->runtime_remaining > 0); + runtime = -cfs_rq->runtime_remaining + 1; if (runtime > remaining) runtime = remaining; -- 2.14.4.44.g2045bb6
[PATCH v2] sched/fair: don't assign runtime for throttled cfs_rq
do_sched_cfs_period_timer() will refill cfs_b runtime and call distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime will allocate all quota to one cfs_rq incorrectly. This will cause other cfs_rq can't get runtime and will be throttled. We find that one throttled cfs_rq has non-negative cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64 in snippet: distribute_cfs_runtime() { runtime = -cfs_rq->runtime_remaining + 1; }. This cast will cause that runtime will be a large number and cfs_b->runtime will be subtracted to be zero at last. According to Ben Segall, the throttled cfs_rq can have account_cfs_rq_runtime called on it because it is throttled before idle_balance, and the idle_balance calls update_rq_clock to add time that is accounted to the task. This commit prevents cfs_rq to be assgined new runtime if it has been throttled to avoid the above incorrect type cast. Signed-off-by: Liangyan Reviewed-by: Ben Segall Reviewed-by: Valentin Schneider --- kernel/sched/fair.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc9cfeaac8bd..ac3ae694d850 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4470,6 +4470,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) if (likely(cfs_rq->runtime_remaining > 0)) return; + if (cfs_rq->throttled) + return; /* * if we're unable to extend our runtime we resched so that the active * hierarchy can be throttled @@ -4673,6 +4675,9 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, if (!cfs_rq_throttled(cfs_rq)) goto next; + /* By the above check, this should never be true */ + WARN_ON(cfs_rq->runtime_remaining > 0); + runtime = -cfs_rq->runtime_remaining + 1; if (runtime > remaining) runtime = remaining; -- 2.14.4.44.g2045bb6
Re: [PATCH] sched/fair: Add missing unthrottle_cfs_rq()
Resend. Sorry that my previous email has format issue. On 19/8/23 上午2:48, bseg...@google.com wrote: Valentin Schneider writes: Turns out a cfs_rq->runtime_remaining can become positive in assign_cfs_rq_runtime(), but this codepath has no call to unthrottle_cfs_rq(). This can leave us in a situation where we have a throttled cfs_rq with positive ->runtime_remaining, which breaks the math in distribute_cfs_runtime(): this function expects a negative value so that it may safely negate it into a positive value. Add the missing unthrottle_cfs_rq(). While at it, add a WARN_ON where we expect negative values, and pull in a comment from the mailing list that didn't make it in [1]. [1]: https://lkml.kernel.org/r/BANLkTi=nmcxkx6ebdqcjydj5kkyg2n1...@mail.gmail.com Cc: Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth") Reported-by: Liangyan Signed-off-by: Valentin Schneider Having now seen the rest of the thread: Could you send the repro, as it doesn't seem to have reached lkml, so that I can confirm my guess as to what's going on? It seems most likely we throttle during one of the remove-change-adds in set_cpus_allowed and friends or during the put half of pick_next_task followed by idle balance to drop the lock. Then distribute races with a later assign_cfs_rq_runtime so that the account finds runtime in the cfs_b. pick_next_task_fair calls update_curr but get zero runtime because of cfs_b->runtime=0, then throttle current cfs_rq because of cfs_rq->runtime_remaining=0, then call put_prev_entity to __account_cfs_rq_runtime to assign new runtime since dequeue_entity on another cpu just call return_cfs_rq_runtime to return some runtime to cfs_b->runtime. Please check below debug log to trace this logic. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..0da556c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4395,6 +4395,12 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) if (likely(cfs_rq->runtime_remaining > 0)) return; + if (cfs_rq->throttled && smp_processor_id()==20) { + pr_err("__account_cfs_rq_runtime:cfs_rq=%p,comm=%s,pid=%d,cpu=%d\n",cfs_rq,current->comm,current->pid,smp_processor_id()); + dump_stack(); + } + //if (cfs_rq->throttled) + // return; /* * if we're unable to extend our runtime we resched so that the active * hierarchy can be throttled @@ -4508,6 +4514,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq) sub_nr_running(rq, task_delta); cfs_rq->throttled = 1; + { + if (cfs_rq->nr_running > 0 && smp_processor_id()==20) { + pr_err("throttle_cfs_rq:cfs_rq=%p,comm=%s,pid=%d,cpu=%d\n",cfs_rq,current->comm,current->pid,smp_processor_id()); + dump_stack(); + } + } cfs_rq->throttled_clock = rq_clock(rq); raw_spin_lock(&cfs_b->lock); empty = list_empty(&cfs_b->throttled_cfs_rq); [ 257.406397] throttle_cfs_rq:cfs_rq=b012d731,comm=delay,pid=4307,cpu=20 [ 257.407251] CPU: 20 PID: 4307 Comm: delay Tainted: GW E 5.2.0-rc3+ #9 [ 257.408795] Call Trace: [ 257.409085] dump_stack+0x5c/0x7b [ 257.409482] throttle_cfs_rq+0x2c3/0x2d0 [ 257.409940] check_cfs_rq_runtime+0x2f/0x50 [ 257.410430] pick_next_task_fair+0xb1/0x740 [ 257.410918] __schedule+0x104/0x670 [ 257.411333] schedule+0x33/0x90 [ 257.411706] exit_to_usermode_loop+0x6a/0xf0 [ 257.412201] prepare_exit_to_usermode+0x80/0xc0 [ 257.412734] retint_user+0x8/0x8 [ 257.413114] RIP: 0033:0x4006d0 [ 257.413480] Code: 7d f8 48 8b 45 f8 48 85 c0 74 24 eb 0d 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 eb 0e 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 <48> ff c8 75 fb 48 ff c8 90 5d c3 55 48 89 e5 48 83 ec 18 48 89 7d [ 257.415630] RSP: 002b:7f9b74abbe90 EFLAGS: 0206 ORIG_RAX: ff13 [ 257.416508] RAX: 00060f7b RBX: RCX: [ 257.417333] RDX: 002625b3 RSI: RDI: 002625b4 [ 257.418155] RBP: 7f9b74abbe90 R08: 7f9b74abc700 R09: 7f9b74abc700 [ 257.418983] R10: 7f9b74abc9d0 R11: R12: 7ffee72e1afe [ 257.419813] R13: 7ffee72e1aff R14: 7ffee72e1b00 R15: [ 257.420718] __account_cfs_rq_runtime:cfs_rq=b012d731,comm=delay,pid=4307,cpu=20 [ 257.421646] CPU: 20 PID: 4307 Comm: delay Tainted: GW E 5.2.0-rc3+ #9 [ 257.422538] Call Trace: [ 257.424712] dump_stack+0x5c/0x7b [ 257.425101] __account_cfs_rq_runtime+0x1d7/0x1e0 [ 257.425656] put_prev_entity+0x90/0x100 [ 257.426102] pick_next_task_fair+0x334/0x740 [ 257.426605] __schedule+0x104/0x670 [ 257.427013] schedule+0x3
[PATCH] sched/fair: don't assign runtime for throttled cfs_rq
do_sched_cfs_period_timer() will refill cfs_b runtime and call distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime will allocate all quota to one cfs_rq incorrectly. This will cause other cfs_rq can't get runtime and will be throttled. We find that one throttled cfs_rq has non-negative cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64 in snippet: distribute_cfs_runtime() { runtime = -cfs_rq->runtime_remaining + 1; }. This cast will cause that runtime will be a large number and cfs_b->runtime will be subtracted to be zero at last. This commit prevents cfs_rq to be assgined new runtime if it has been throttled to avoid the above incorrect type cast. Signed-off-by: Liangyan --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 81fd8a2a605b..b14d67d28138 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4068,6 +4068,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) if (likely(cfs_rq->runtime_remaining > 0)) return; + if (cfs_rq->throttled) + return; /* * if we're unable to extend our runtime we resched so that the active * hierarchy can be throttled -- 2.14.4.44.g2045bb6
[PATCH] sched/fair: don't restart enqueued cfs quota slack timer
From: "liangyan.ply" start_cfs_slack_bandwidth() will restart the quota slack timer, if it is called frequently, this timer will be restarted continuously and may have no chance to expire to unthrottle cfs tasks. This will cause that the throttled tasks can't be unthrottled in time although they have remaining quota. Signed-off-by: Liangyan --- kernel/sched/fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d90a64620072..fdb03c752f97 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4411,9 +4411,11 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b) if (runtime_refresh_within(cfs_b, min_left)) return; - hrtimer_start(&cfs_b->slack_timer, + if (!hrtimer_active(&cfs_b->slack_timer)) { + hrtimer_start(&cfs_b->slack_timer, ns_to_ktime(cfs_bandwidth_slack_period), HRTIMER_MODE_REL); + } } /* we know any runtime found here is valid as update_curr() precedes return */ -- 2.14.4.44.g2045bb6