Re: [PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry
On Fri, Jan 08, 2021 at 10:42:47AM +0800, Wen Yang wrote: > > > 在 2021/1/8 上午2:28, Greg Kroah-Hartman 写道: > > On Fri, Jan 08, 2021 at 12:21:38AM +0800, Wen Yang wrote: > > > > > > > > > 在 2021/1/7 下午8:17, Greg Kroah-Hartman 写道: > > > > On Thu, Jan 07, 2021 at 03:52:12PM +0800, Wen Yang wrote: > > > > > The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, > > > > > they > > > > > should be deleted when the process exits. > > > > > > > > > > Suppose the following race appears: > > > > > > > > > > release_task dput > > > > > -> proc_flush_task > > > > >-> dentry->d_op->d_delete(dentry) > > > > > -> __exit_signal > > > > >-> dentry->d_lockref.count-- and > > > > > return. > > > > > > > > > > In the proc_flush_task(), if another process is using this dentry, it > > > > > will > > > > > not be deleted. At the same time, in dput(), d_op->d_delete() can be > > > > > executed > > > > > before __exit_signal(pid has not been hashed), d_delete returns > > > > > false, so > > > > > this dentry still cannot be deleted. > > > > > > > > > > This dentry will always be cached (although its count is 0 and the > > > > > DCACHE_OP_DELETE flag is set), its parent denry will also be cached > > > > > too, and > > > > > these dentries can only be deleted when drop_caches is manually > > > > > triggered. > > > > > > > > > > This will result in wasted memory. What's more troublesome is that > > > > > these > > > > > dentries reference pid, according to the commit f333c700c610 ("pidns: > > > > > Add a > > > > > limit on the number of pid namespaces"), if the pid cannot be > > > > > released, it > > > > > may result in the inability to create a new pid_ns. > > > > > > > > > > This issue was introduced by 60347f6716aa ("pid namespaces: prepare > > > > > proc_flust_task() to flush entries from multiple proc trees"), > > > > > exposed by > > > > > f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), > > > > > and then > > > > > fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from > > > > > proc"). > > > > > > > > Why are you just submitting a series for 4.9 and 4.19, what about 4.14? > > > > We can't have users move to a newer kernel and then experience old bugs, > > > > right? > > > > > > > Okay, the patches corresponding to 4.14 will be ready later. > > > > Note for some reason you didn't cc: the stable list for these patches :( > > > > > > But the larger question is why are you backporting a whole new feature > > > > here? Why is CLONE_PIDFD needed? That feels really wrong... > > > > > > > > > > The reason for backporting CLONE_PIDFD is because 7bc3e6e55acf ("proc: > > > Use a > > > list of inodes to flush from proc") relies on wait_pidfd.lock. There are > > > indeed many associated modifications here. We are also testing it. Please > > > check the code more. > > > > Is the only "issue" here wasted memory? Will it eventually be freed > > anyway even if you do not echo to the proc file to flush caches? > > > > You mention the inability to create a new pid for a specific namespace, > > is that really a problem? Shouldn't the code handle such issues > > normally? What breaks without these changes? > > > > I think at this point, it might just time for you to move to a newer > > kernel release, as adding a whole new userspace feature for this feels > > really really odd. > > > > What is preventing you from doing that today? What holds you to older > > kernels that will not allow you to move forward? > > > > We have encountered this problem in the cloud server environment. Users will > frequently create and delete containers, and the corresponding pid_ns will > accumulate, eventually making it impossible to create a new container. > > https://bugzilla.kernel.org/show_bug.cgi?id=208613 > > The kernels (4.9/4.19) used on a large scale in our current production > environment (almost tens of thousands of machines) may need to be fixed. What prevents you from moving them to 5.4 or better yet, 5.10? You will have to do it soon anyway, I'm sure you have been testing those kernels to validate that all works well with them on a subset of your environment, so for those systems that have this problem, why can't you update the base kernel? thanks, greg k-h
Re: [PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry
在 2021/1/8 上午2:28, Greg Kroah-Hartman 写道: On Fri, Jan 08, 2021 at 12:21:38AM +0800, Wen Yang wrote: 在 2021/1/7 下午8:17, Greg Kroah-Hartman 写道: On Thu, Jan 07, 2021 at 03:52:12PM +0800, Wen Yang wrote: The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they should be deleted when the process exits. Suppose the following race appears: release_task dput -> proc_flush_task -> dentry->d_op->d_delete(dentry) -> __exit_signal -> dentry->d_lockref.count-- and return. In the proc_flush_task(), if another process is using this dentry, it will not be deleted. At the same time, in dput(), d_op->d_delete() can be executed before __exit_signal(pid has not been hashed), d_delete returns false, so this dentry still cannot be deleted. This dentry will always be cached (although its count is 0 and the DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and these dentries can only be deleted when drop_caches is manually triggered. This will result in wasted memory. What's more troublesome is that these dentries reference pid, according to the commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if the pid cannot be released, it may result in the inability to create a new pid_ns. This issue was introduced by 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees"), exposed by f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc"). Why are you just submitting a series for 4.9 and 4.19, what about 4.14? We can't have users move to a newer kernel and then experience old bugs, right? Okay, the patches corresponding to 4.14 will be ready later. Note for some reason you didn't cc: the stable list for these patches :( But the larger question is why are you backporting a whole new feature here? Why is CLONE_PIDFD needed? That feels really wrong... The reason for backporting CLONE_PIDFD is because 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc") relies on wait_pidfd.lock. There are indeed many associated modifications here. We are also testing it. Please check the code more. Is the only "issue" here wasted memory? Will it eventually be freed anyway even if you do not echo to the proc file to flush caches? You mention the inability to create a new pid for a specific namespace, is that really a problem? Shouldn't the code handle such issues normally? What breaks without these changes? I think at this point, it might just time for you to move to a newer kernel release, as adding a whole new userspace feature for this feels really really odd. What is preventing you from doing that today? What holds you to older kernels that will not allow you to move forward? We have encountered this problem in the cloud server environment. Users will frequently create and delete containers, and the corresponding pid_ns will accumulate, eventually making it impossible to create a new container. https://bugzilla.kernel.org/show_bug.cgi?id=208613 The kernels (4.9/4.19) used on a large scale in our current production environment (almost tens of thousands of machines) may need to be fixed. Thanks. -- Best wishes, Wen
Re: [PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry
On Fri, Jan 08, 2021 at 12:21:38AM +0800, Wen Yang wrote: > > > 在 2021/1/7 下午8:17, Greg Kroah-Hartman 写道: > > On Thu, Jan 07, 2021 at 03:52:12PM +0800, Wen Yang wrote: > > > The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they > > > should be deleted when the process exits. > > > > > > Suppose the following race appears: > > > > > > release_task dput > > > -> proc_flush_task > > > -> dentry->d_op->d_delete(dentry) > > > -> __exit_signal > > > -> dentry->d_lockref.count-- and return. > > > > > > In the proc_flush_task(), if another process is using this dentry, it will > > > not be deleted. At the same time, in dput(), d_op->d_delete() can be > > > executed > > > before __exit_signal(pid has not been hashed), d_delete returns false, so > > > this dentry still cannot be deleted. > > > > > > This dentry will always be cached (although its count is 0 and the > > > DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, > > > and > > > these dentries can only be deleted when drop_caches is manually triggered. > > > > > > This will result in wasted memory. What's more troublesome is that these > > > dentries reference pid, according to the commit f333c700c610 ("pidns: Add > > > a > > > limit on the number of pid namespaces"), if the pid cannot be released, it > > > may result in the inability to create a new pid_ns. > > > > > > This issue was introduced by 60347f6716aa ("pid namespaces: prepare > > > proc_flust_task() to flush entries from multiple proc trees"), exposed by > > > f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and > > > then > > > fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc"). > > > > Why are you just submitting a series for 4.9 and 4.19, what about 4.14? > > We can't have users move to a newer kernel and then experience old bugs, > > right? > > > Okay, the patches corresponding to 4.14 will be ready later. Note for some reason you didn't cc: the stable list for these patches :( > > But the larger question is why are you backporting a whole new feature > > here? Why is CLONE_PIDFD needed? That feels really wrong... > > > > The reason for backporting CLONE_PIDFD is because 7bc3e6e55acf ("proc: Use a > list of inodes to flush from proc") relies on wait_pidfd.lock. There are > indeed many associated modifications here. We are also testing it. Please > check the code more. Is the only "issue" here wasted memory? Will it eventually be freed anyway even if you do not echo to the proc file to flush caches? You mention the inability to create a new pid for a specific namespace, is that really a problem? Shouldn't the code handle such issues normally? What breaks without these changes? I think at this point, it might just time for you to move to a newer kernel release, as adding a whole new userspace feature for this feels really really odd. What is preventing you from doing that today? What holds you to older kernels that will not allow you to move forward? thanks, greg k-h
Re: [PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry
在 2021/1/7 下午8:17, Greg Kroah-Hartman 写道: On Thu, Jan 07, 2021 at 03:52:12PM +0800, Wen Yang wrote: The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they should be deleted when the process exits. Suppose the following race appears: release_task dput -> proc_flush_task -> dentry->d_op->d_delete(dentry) -> __exit_signal -> dentry->d_lockref.count-- and return. In the proc_flush_task(), if another process is using this dentry, it will not be deleted. At the same time, in dput(), d_op->d_delete() can be executed before __exit_signal(pid has not been hashed), d_delete returns false, so this dentry still cannot be deleted. This dentry will always be cached (although its count is 0 and the DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and these dentries can only be deleted when drop_caches is manually triggered. This will result in wasted memory. What's more troublesome is that these dentries reference pid, according to the commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if the pid cannot be released, it may result in the inability to create a new pid_ns. This issue was introduced by 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees"), exposed by f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc"). Why are you just submitting a series for 4.9 and 4.19, what about 4.14? We can't have users move to a newer kernel and then experience old bugs, right? Okay, the patches corresponding to 4.14 will be ready later. But the larger question is why are you backporting a whole new feature here? Why is CLONE_PIDFD needed? That feels really wrong... The reason for backporting CLONE_PIDFD is because 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc") relies on wait_pidfd.lock. There are indeed many associated modifications here. We are also testing it. Please check the code more. Thanks. -- Best wishes, Wen
Re: [PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry
On Thu, Jan 07, 2021 at 03:52:12PM +0800, Wen Yang wrote: > The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they > should be deleted when the process exits. > > Suppose the following race appears: > > release_task dput > -> proc_flush_task > -> dentry->d_op->d_delete(dentry) > -> __exit_signal > -> dentry->d_lockref.count-- and return. > > In the proc_flush_task(), if another process is using this dentry, it will > not be deleted. At the same time, in dput(), d_op->d_delete() can be executed > before __exit_signal(pid has not been hashed), d_delete returns false, so > this dentry still cannot be deleted. > > This dentry will always be cached (although its count is 0 and the > DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and > these dentries can only be deleted when drop_caches is manually triggered. > > This will result in wasted memory. What's more troublesome is that these > dentries reference pid, according to the commit f333c700c610 ("pidns: Add a > limit on the number of pid namespaces"), if the pid cannot be released, it > may result in the inability to create a new pid_ns. > > This issue was introduced by 60347f6716aa ("pid namespaces: prepare > proc_flust_task() to flush entries from multiple proc trees"), exposed by > f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then > fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc"). Why are you just submitting a series for 4.9 and 4.19, what about 4.14? We can't have users move to a newer kernel and then experience old bugs, right? But the larger question is why are you backporting a whole new feature here? Why is CLONE_PIDFD needed? That feels really wrong... thanks, greg k-h
[PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry
The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they should be deleted when the process exits. Suppose the following race appears: release_task dput -> proc_flush_task -> dentry->d_op->d_delete(dentry) -> __exit_signal -> dentry->d_lockref.count-- and return. In the proc_flush_task(), if another process is using this dentry, it will not be deleted. At the same time, in dput(), d_op->d_delete() can be executed before __exit_signal(pid has not been hashed), d_delete returns false, so this dentry still cannot be deleted. This dentry will always be cached (although its count is 0 and the DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and these dentries can only be deleted when drop_caches is manually triggered. This will result in wasted memory. What's more troublesome is that these dentries reference pid, according to the commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if the pid cannot be released, it may result in the inability to create a new pid_ns. This issue was introduced by 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees"), exposed by f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc"). Alexey Dobriyan (1): proc: use %u for pid printing and slightly less stack Andreas Gruenbacher (1): proc: Pass file mode to proc_pid_make_inode Christian Brauner (1): clone: add CLONE_PIDFD Eric W. Biederman (6): proc: Better ownership of files for non-dumpable tasks in user namespaces proc: Rename in proc_inode rename sysctl_inodes sibling_inodes proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache proc: Clear the pieces of proc_inode that proc_evict_inode cares about proc: Use d_invalidate in proc_prune_siblings_dcache proc: Use a list of inodes to flush from proc Joel Fernandes (Google) (1): pidfd: add polling support fs/proc/base.c | 242 - fs/proc/fd.c | 20 +--- fs/proc/inode.c| 67 - fs/proc/internal.h | 22 ++--- fs/proc/namespaces.c | 3 +- fs/proc/proc_sysctl.c | 45 ++--- fs/proc/self.c | 6 +- fs/proc/thread_self.c | 5 +- include/linux/pid.h| 5 + include/linux/proc_fs.h| 4 +- include/uapi/linux/sched.h | 1 + kernel/exit.c | 5 +- kernel/fork.c | 131 +++- kernel/pid.c | 3 + kernel/signal.c| 11 +++ security/selinux/hooks.c | 1 + 16 files changed, 343 insertions(+), 228 deletions(-) -- 1.8.3.1
[PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry
The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they should be deleted when the process exits. Suppose the following race appears: release_task dput -> proc_flush_task -> dentry->d_op->d_delete(dentry) -> __exit_signal -> dentry->d_lockref.count-- and return. In the proc_flush_task(), if another process is using this dentry, it will not be deleted. At the same time, in dput(), d_op->d_delete() can be executed before __exit_signal(pid has not been hashed), d_delete returns false, so this dentry still cannot be deleted. This dentry will always be cached (although its count is 0 and the DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and these dentries can only be deleted when drop_caches is manually triggered. This will result in wasted memory. What's more troublesome is that these dentries reference pid, according to the commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if the pid cannot be released, it may result in the inability to create a new pid_ns. This issue was introduced by 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees"), exposed by f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc"). Alexey Dobriyan (1): proc: use %u for pid printing and slightly less stack Andreas Gruenbacher (1): proc: Pass file mode to proc_pid_make_inode Christian Brauner (1): clone: add CLONE_PIDFD Eric W. Biederman (6): proc: Better ownership of files for non-dumpable tasks in user namespaces proc: Rename in proc_inode rename sysctl_inodes sibling_inodes proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache proc: Clear the pieces of proc_inode that proc_evict_inode cares about proc: Use d_invalidate in proc_prune_siblings_dcache proc: Use a list of inodes to flush from proc Joel Fernandes (Google) (1): pidfd: add polling support fs/proc/base.c | 242 - fs/proc/fd.c | 20 +--- fs/proc/inode.c| 67 - fs/proc/internal.h | 22 ++--- fs/proc/namespaces.c | 3 +- fs/proc/proc_sysctl.c | 45 ++--- fs/proc/self.c | 6 +- fs/proc/thread_self.c | 5 +- include/linux/pid.h| 5 + include/linux/proc_fs.h| 4 +- include/uapi/linux/sched.h | 1 + kernel/exit.c | 5 +- kernel/fork.c | 131 +++- kernel/pid.c | 3 + kernel/signal.c| 11 +++ security/selinux/hooks.c | 1 + 16 files changed, 343 insertions(+), 228 deletions(-) -- 1.8.3.1