Re: [PATCH] net: sched: sch_teql: fix null-pointer dereference
On 4/8/21 6:26 PM, Eric Dumazet wrote: On 4/8/21 5:14 PM, Pavel Tikhomirov wrote: Reproduce: modprobe sch_teql tc qdisc add dev teql0 root teql0 This leads to (for instance in Centos 7 VM) OOPS: Null pointer dereference happens on master->slaves dereference in teql_destroy() as master is null-pointer. When qdisc_create() calls teql_qdisc_init() it imediately fails after check "if (m->dev == dev)" because both devices are teql0, and it does not set qdisc_priv(sch)->m leaving it zero on error path, then qdisc_create() imediately calls teql_destroy() which does not expect zero master pointer and we get OOPS. Signed-off-by: Pavel Tikhomirov --- This makes sense, thanks ! Reviewed-by: Eric Dumazet Thanks! I would think bug origin is Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation") Can you confirm you have this backported to 3.10.0-1062.7.1.el7.x86_64 ? According to our source copy it looks backported to 1062.7.1, please see: https://src.openvz.org/projects/OVZ/repos/vzkernel/browse/net/sched/sch_api.c?at=refs%2Ftags%2Frh7-3.10.0-1062.7.1.el7#1167 -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
[PATCH] net: sched: sch_teql: fix null-pointer dereference
Reproduce: modprobe sch_teql tc qdisc add dev teql0 root teql0 This leads to (for instance in Centos 7 VM) OOPS: [ 532.366633] BUG: unable to handle kernel NULL pointer dereference at 00a8 [ 532.366733] IP: [] teql_destroy+0x18/0x100 [sch_teql] [ 532.366825] PGD 8001376d5067 PUD 137e37067 PMD 0 [ 532.366906] Oops: [#1] SMP [ 532.366987] Modules linked in: sch_teql ... [ 532.367945] CPU: 1 PID: 3026 Comm: tc Kdump: loaded Tainted: G T 3.10.0-1062.7.1.el7.x86_64 #1 [ 532.368041] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.2 04/01/2014 [ 532.368125] task: 8b7d37d31070 ti: 8b7c9fdbc000 task.ti: 8b7c9fdbc000 [ 532.368224] RIP: 0010:[] [] teql_destroy+0x18/0x100 [sch_teql] [ 532.368320] RSP: 0018:8b7c9fdbf8e0 EFLAGS: 00010286 [ 532.368394] RAX: c0612490 RBX: 8b7cb1565e00 RCX: 8b7d35ba2000 [ 532.368476] RDX: 8b7d35ba2000 RSI: RDI: 8b7cb1565e00 [ 532.368557] RBP: 8b7c9fdbf8f8 R08: 8b7d3fd1f140 R09: 8b7d3b001600 [ 532.368638] R10: 8b7d3b001600 R11: 84c7d65b R12: ffd8 [ 532.368719] R13: 8000 R14: 8b7d35ba2000 R15: 8b7c9fdbf9a8 [ 532.368800] FS: 7f6a4e872740() GS:8b7d3fd0() knlGS: [ 532.368885] CS: 0010 DS: ES: CR0: 80050033 [ 532.368961] CR2: 00a8 CR3: 0001396ee000 CR4: 000206e0 [ 532.369046] Call Trace: [ 532.369159] [] qdisc_create+0x36e/0x450 [ 532.369268] [] ? ns_capable+0x29/0x50 [ 532.369366] [] ? nla_parse+0x32/0x120 [ 532.369442] [] tc_modify_qdisc+0x13c/0x610 [ 532.371508] [] rtnetlink_rcv_msg+0xa7/0x260 [ 532.372668] [] ? sock_has_perm+0x75/0x90 [ 532.373790] [] ? rtnl_newlink+0x890/0x890 [ 532.374914] [] netlink_rcv_skb+0xab/0xc0 [ 532.376055] [] rtnetlink_rcv+0x28/0x30 [ 532.377204] [] netlink_unicast+0x170/0x210 [ 532.378333] [] netlink_sendmsg+0x308/0x420 [ 532.379465] [] sock_sendmsg+0xb6/0xf0 [ 532.380710] [] ? __xfs_filemap_fault+0x8e/0x1d0 [xfs] [ 532.381868] [] ? xfs_filemap_fault+0x2c/0x30 [xfs] [ 532.383037] [] ? __do_fault.isra.61+0x8a/0x100 [ 532.384144] [] ___sys_sendmsg+0x3e9/0x400 [ 532.385268] [] ? handle_mm_fault+0x39d/0x9b0 [ 532.386387] [] ? __do_page_fault+0x238/0x500 [ 532.387472] [] __sys_sendmsg+0x51/0x90 [ 532.388560] [] SyS_sendmsg+0x12/0x20 [ 532.389636] [] system_call_fastpath+0x25/0x2a [ 532.390704] [] ? system_call_after_swapgs+0xae/0x146 [ 532.391753] Code: 00 00 00 00 00 00 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b b7 48 01 00 00 48 89 fb <48> 8b 8e a8 00 00 00 48 85 c9 74 43 48 89 ca eb 0f 0f 1f 80 00 [ 532.394036] RIP [] teql_destroy+0x18/0x100 [sch_teql] [ 532.395127] RSP [ 532.396179] CR2: 00a8 Null pointer dereference happens on master->slaves dereference in teql_destroy() as master is null-pointer. When qdisc_create() calls teql_qdisc_init() it imediately fails after check "if (m->dev == dev)" because both devices are teql0, and it does not set qdisc_priv(sch)->m leaving it zero on error path, then qdisc_create() imediately calls teql_destroy() which does not expect zero master pointer and we get OOPS. Signed-off-by: Pavel Tikhomirov --- net/sched/sch_teql.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c index 2f1f0a378408..6af6b95bdb67 100644 --- a/net/sched/sch_teql.c +++ b/net/sched/sch_teql.c @@ -134,6 +134,9 @@ teql_destroy(struct Qdisc *sch) struct teql_sched_data *dat = qdisc_priv(sch); struct teql_master *master = dat->m; + if (!master) + return; + prev = master->slaves; if (prev) { do { -- 2.30.2
[PATCH] clone3: add option to change owner of newly created namespaces
Let's add a flag CLONE_OWNER_NS and clone_args.userns_fd field to specify a user namespace for clone3 which would become an owner of newly created namespaces. This owner is restricted to be a descendant of current user namespace. It means that we can do clone while in more privileged user namespace than the one which would become an owner. We need this for CRIU. 1) When CRIU is dumping a container it can face nested user namespaces and pid namespaces, to properly dump/restore them CRIU needs to also restore all dependencies between namespaces: parent-child and userns-owner. Previously when CRIU was recreating the process tree of container during restore if we needed to restore a pid namespace init we should first enter a proper user namespace which should be an owner of restored pid namespace and only than can clone. That means that the cloned process is initially created in probably unprivileged user namespace and it brings a lot of restrictions on what it can do (e.g. we should probably enter it's ipc/uts/net namespaces already, if they are owned by more privileged user namespace we would not be able to enter them later). With new userns_fd option we would be able to recreate process tree with all container pid namespaces with proper user namespace owners for them while all processes are left in most privileged user namespace, and it would be easier to restore all resources of those processes. We can restore user namespace of each process later when needed. 2) Other problem which this option is trying to solve is that clone3() with set_tid does not work as desired when we try to recreate a process in the container which has nested user and pid namespaces. Imagine that in container we have a chain of processes each of them was created with clone() with (CLONE_NEWPID | CLONE_NEWUSER) from previous process and in each pid namespace of this chain we've also created some amount of processes to hold pid numbers. Next we create one more "target" process with new pid and user namespaces in the end of chain, it can have random pids in each pid namespace of the chain. When CRIU would restore this container it would not be able to restore pids of "target" process on each pid namespace level with clone3()+ set_tid because it would need to call clone from owner user namespace (or it's parent if we use CLONE_NEWUSER) of the "target"'s pid namespace, which has no rights to set_tid on each needed level. With new userns_fd option we would easily do it, we just need to have current user namespace to be root user namespace of the container and pass "target"'s pid namespace owner user namespace to userns_fd. Here are two examples on the use of new userns_fd option: - clone3_owner_ns.c - is simple demonstration of how process can create pid namespace owned by it's user namespace descendant; - clone3_set_tid_vs_owner_ns.c - is a bit more complex demonstration on how clone3+set_tid can work for restoring pids on each level of nested user and pid namespaces when used together with userns_fd: https://github.com/Snorch/clone3_owner_ns Signed-off-by: Pavel Tikhomirov --- include/linux/nsproxy.h| 3 ++- include/linux/sched/task.h | 1 + include/linux/user_namespace.h | 6 ++ include/uapi/linux/sched.h | 3 +++ kernel/fork.c | 18 +++--- kernel/nsproxy.c | 19 ++- kernel/user_namespace.c| 22 ++ 7 files changed, 67 insertions(+), 5 deletions(-) diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h index cdb171efc7cb..201bbb75637d 100644 --- a/include/linux/nsproxy.h +++ b/include/linux/nsproxy.h @@ -91,7 +91,8 @@ static inline struct cred *nsset_cred(struct nsset *set) * */ -int copy_namespaces(unsigned long flags, struct task_struct *tsk); +int copy_namespaces(unsigned long flags, struct task_struct *tsk, + int userns_fd); void exit_task_namespaces(struct task_struct *tsk); void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new); void free_nsproxy(struct nsproxy *ns); diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index c0f71f2e7160..176b087443a0 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -33,6 +33,7 @@ struct kernel_clone_args { int cgroup; struct cgroup *cgrp; struct css_set *cset; + int userns_fd; }; /* diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 64cf8ebdc4ec..cecc4c55b7cb 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -136,6 +136,7 @@ extern bool in_userns(const struct user_namespace *ancestor, const struct user_namespace *child); extern bool current_in_userns(const struct user_namespace *target_ns); struct ns_common *ns_get_owner(struct ns_common *ns); +extern struct user_namespac
Re: [PATCH] move_mount: allow to add a mount into an existing group
On 3/29/21 12:47 AM, Andrei Vagin wrote: On Thu, Mar 25, 2021 at 03:14:44PM +0300, Pavel Tikhomirov wrote: Previously a sharing group (shared and master ids pair) can be only inherited when mount is created via bindmount. This patch adds an ability to add an existing private mount into an existing sharing group. With this functionality one can first create the desired mount tree from only private mounts (without the need to care about undesired mount propagation or mount creation order implied by sharing group dependencies), and next then setup any desired mount sharing between those mounts in tree as needed. This allows CRIU to restore any set of mount namespaces, mount trees and sharing group trees for a container. We have many issues with restoring mounts in CRIU related to sharing groups and propagation: - reverse sharing groups vs mount tree order requires complex mounts reordering which mostly implies also using some temporary mounts (please see https://lkml.org/lkml/2021/3/23/569 for more info) - mount() syscall creates tons of mounts due to propagation - mount re-parenting due to propagation - "Mount Trap" due to propagation - "Non Uniform" propagation, meaning that with different tricks with mount order and temporary children-"lock" mounts one can create mount trees which can't be restored without those tricks (see https://www.linuxplumbersconf.org/event/7/contributions/640/) With this new functionality we can resolve all the problems with propagation at once. Thanks for picking this up. Overall it looks good for me. Here is one comment inline. Cc: Eric W. Biederman Cc: Alexander Viro Cc: Andrei Vagin Cc: linux-fsde...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: lkml Signed-off-by: Pavel Tikhomirov --- This is a rework of "mnt: allow to add a mount into an existing group" patch from Andrei. https://lkml.org/lkml/2017/4/28/20 New do_set_group is similar to do_move_mount, but with many restrictions of do_move_mount removed and that's why: 1) Allow "cross-namespace" sharing group set. If we allow operation only with mounts from current+anon mount namespace one would still be able to setns(from_mntns) + open_tree(from, OPEN_TREE_CLONE) + setns(to_mntns) + move_mount(anon, to, MOVE_MOUNT_SET_GROUP) to set sharing group to mount in different mount namespace with source mount. But with this approach we would need to create anon mount namespace and mount copy each time, which is just a waste of resources. So instead lets just check if we are allowed to modify both mount namespaces (which looks equivalent to what setns-es and open_tree check). 2) Allow operating on non-root dentry of the mount. As if we prohibit it this would require extra care from CRIU side in places where we wan't to copy sharing group from mount on host (for external mounts) and user gives us path to non-root dentry. I don't see any problem with referencing mount with any dentry for sharing group setting. Also there is no problem with referencing one by file and one by directory. 3) Also checks wich only apply to actually moving mount which we have in do_move_mount and open_tree are skipped. We don't need to check MNT_LOCKED, unbindable, nsfs loops and ancestor relation as we don't move mounts. Security note: there would be no (new) loops in sharing groups tree, because this new move_mount(MOVE_MOUNT_SET_GROUP) operation only adds one _private_ mount to one group (without moving between groups), the sharing groups tree itself stays unchanged after it. In Virtuozzo we have "mount-v2" implementation, based with the original kernel patch from Andrei, tested for almost a year and it actually decreased number of bugs with mounts a lot. One can take a look on the implementation of sharing group restore in CRIU in "mount-v2" here: https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#898 This works almost the same with current version of patch if we replace mount(MS_SET_GROUP) to move_mount(MOVE_MOUNT_SET_GROUP), please see super-draft port for mainstream criu, this at least passes non-user-namespaced mount tests (zdtm.py --mounts-v2 -f ns). https://github.com/Snorch/criu/commits/mount-v2-poc --- fs/namespace.c | 57 +- include/uapi/linux/mount.h | 3 +- 2 files changed, 58 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9d33909d0f9e..ab439d8510dd 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2660,6 +2660,58 @@ static bool check_for_nsfs_mounts(struct mount *subtree) return ret; } +static int do_set_group(struct path *from_path, struct path *to_path) +{ + struct mount *from, *to; + int err; + + from = real_mount(from_path->mnt); + to = real_mount(to_path->mnt); + + namespace_lock(); + + err = -EINVAL; + /* To and From must be
[PATCH] move_mount: allow to add a mount into an existing group
Previously a sharing group (shared and master ids pair) can be only inherited when mount is created via bindmount. This patch adds an ability to add an existing private mount into an existing sharing group. With this functionality one can first create the desired mount tree from only private mounts (without the need to care about undesired mount propagation or mount creation order implied by sharing group dependencies), and next then setup any desired mount sharing between those mounts in tree as needed. This allows CRIU to restore any set of mount namespaces, mount trees and sharing group trees for a container. We have many issues with restoring mounts in CRIU related to sharing groups and propagation: - reverse sharing groups vs mount tree order requires complex mounts reordering which mostly implies also using some temporary mounts (please see https://lkml.org/lkml/2021/3/23/569 for more info) - mount() syscall creates tons of mounts due to propagation - mount re-parenting due to propagation - "Mount Trap" due to propagation - "Non Uniform" propagation, meaning that with different tricks with mount order and temporary children-"lock" mounts one can create mount trees which can't be restored without those tricks (see https://www.linuxplumbersconf.org/event/7/contributions/640/) With this new functionality we can resolve all the problems with propagation at once. Cc: Eric W. Biederman Cc: Alexander Viro Cc: Andrei Vagin Cc: linux-fsde...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: lkml Signed-off-by: Pavel Tikhomirov --- This is a rework of "mnt: allow to add a mount into an existing group" patch from Andrei. https://lkml.org/lkml/2017/4/28/20 New do_set_group is similar to do_move_mount, but with many restrictions of do_move_mount removed and that's why: 1) Allow "cross-namespace" sharing group set. If we allow operation only with mounts from current+anon mount namespace one would still be able to setns(from_mntns) + open_tree(from, OPEN_TREE_CLONE) + setns(to_mntns) + move_mount(anon, to, MOVE_MOUNT_SET_GROUP) to set sharing group to mount in different mount namespace with source mount. But with this approach we would need to create anon mount namespace and mount copy each time, which is just a waste of resources. So instead lets just check if we are allowed to modify both mount namespaces (which looks equivalent to what setns-es and open_tree check). 2) Allow operating on non-root dentry of the mount. As if we prohibit it this would require extra care from CRIU side in places where we wan't to copy sharing group from mount on host (for external mounts) and user gives us path to non-root dentry. I don't see any problem with referencing mount with any dentry for sharing group setting. Also there is no problem with referencing one by file and one by directory. 3) Also checks wich only apply to actually moving mount which we have in do_move_mount and open_tree are skipped. We don't need to check MNT_LOCKED, unbindable, nsfs loops and ancestor relation as we don't move mounts. Security note: there would be no (new) loops in sharing groups tree, because this new move_mount(MOVE_MOUNT_SET_GROUP) operation only adds one _private_ mount to one group (without moving between groups), the sharing groups tree itself stays unchanged after it. In Virtuozzo we have "mount-v2" implementation, based with the original kernel patch from Andrei, tested for almost a year and it actually decreased number of bugs with mounts a lot. One can take a look on the implementation of sharing group restore in CRIU in "mount-v2" here: https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#898 This works almost the same with current version of patch if we replace mount(MS_SET_GROUP) to move_mount(MOVE_MOUNT_SET_GROUP), please see super-draft port for mainstream criu, this at least passes non-user-namespaced mount tests (zdtm.py --mounts-v2 -f ns). https://github.com/Snorch/criu/commits/mount-v2-poc --- fs/namespace.c | 57 +- include/uapi/linux/mount.h | 3 +- 2 files changed, 58 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9d33909d0f9e..ab439d8510dd 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2660,6 +2660,58 @@ static bool check_for_nsfs_mounts(struct mount *subtree) return ret; } +static int do_set_group(struct path *from_path, struct path *to_path) +{ + struct mount *from, *to; + int err; + + from = real_mount(from_path->mnt); + to = real_mount(to_path->mnt); + + namespace_lock(); + + err = -EINVAL; + /* To and From must be mounted */ + if (!is_mounted(&from->mnt)) + goto out; + if (!is_mounted(&to->mnt)) + goto out; + + err = -EPERM; + /* We should be allowed to modif
Re: [CRIU] [PATCH] mnt: allow to add a mount into an existing group
Adding Andrew to CC with the right email. On 3/23/21 3:59 PM, Pavel Tikhomirov wrote: Hi! Can we restart the discussion on this topic? In CRIU we need to be able to dump/restore all mount trees of system container (CT). CT can have anything inside - users which create their custom mounts configuration, systemd with custom mount namespaces for it's services, nested application containers inside the CT with their own mount namespaces, and all mounts in CT mount trees can be grouped by sharing groupes (e.g. same shared_id + master_id pair), and those groups can depend one from another forming a tree structure of sharing groups. 1) Imagine that we have this sharing group tree (in format (shared_id, master_id), 0 means no sharing, we don't care about actual mounts for now only master-slave dependencies between sharing groups): (1,0) |- (2,1) |- (3,1) |- (4,3) |- (0,4) The main problem of restoring mounts is the fact that sharing groups currently can be only inherited, e.g. if you have one mount (first) with shared_id = x, master_id = y, the only way to get another mount with (x,y) is to create a bindmount from the first mount. Also to create mount (y,z) from mount (x,y) one should also first inherit (x,y) via bindmount and than change to (y,z). This means that mentioned above tree puts restriction on the mounts creation order, one need to have at least one mount for each of sharing groups (1,0), (3,1) and (4,3) before creating the first mount of the sharing group (0,4). But what if we want to mount (restore) actual mounts in this mount tree "reverse" order: mntid parent mountpoint (shared_id, master_id) 101 0 /tmp (0,4) 102 101 /tmp (4,3) 103 102 /tmp (3,1) 104 103 /tmp (1,0) Mount 104's sharing group should be created before mount 101, 102 and 103 sharing groups, but mount 104 should be created after those mounts. One can actually prepare this setup (on mainstream kernel) by pre-creating sharing groups elsewhere and then binding to /tmp in proper order with careful unmounting of propagations (see test.sh attached): [root@snorch propagation-tests]# bash ../test.sh 960 1120 0:56 / /tmp/propagation-tests/tmp rw,relatime master:452 - tmpfs propagation-tests-src rw,inode64 958 960 0:56 / /tmp/propagation-tests/tmp/sub rw,relatime shared:452 master:451 - tmpfs propagation-tests-src rw,inode64 961 958 0:56 / /tmp/propagation-tests/tmp/sub/sub rw,relatime shared:451 master:433 - tmpfs propagation-tests-src rw,inode64 963 961 0:56 / /tmp/propagation-tests/tmp/sub/sub/sub rw,relatime shared:433 - tmpfs propagation-tests-src rw,inode64 But this "pre-creating" from test.sh is not universal at all and only works for this simple case. CRIU does not know anything about the history of mount creation for system container, it also does not know anything about any temporary mounts which were used and then removed. So understanding the proper order is almost impossible like Andrew says. I've also prepared a presentation on Linux Plumbers last year about how much problems propagation brings to mounts restore in CRIU, you can take a look here https://www.linuxplumbersconf.org/event/7/contributions/640/ 2) Propagation creates tons of mounts 3) Mount reparenting 4) "Mount trap" 5) "Non-uniform" propagation 6) “Cross-namespace” sharing groups Allowing to create mounts private first and create sharing groups later and copy sharing groups later instead of inheriting them resolves all the problems with propagation at once. One can take a look on the implementation of sharing group restore in CRIU if we have this (mnt: allow to add a mount into an existing group) patch applied: https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L898 Obviously this does not solve all the problems with mounts I know about but it's a big step forward in properly supporting them in CRIU. We already have this tested in Virtuozzo for almost a year and it works nice. Notes: - There is another idea, but I should say early that I don't like it, because with it restoring mounts with criu would be still super complex. We can add extra flag to mount/move_mount syscall to disable propagation temporary so that CRIU can restore the mount tree without problems 2-5, also we can now create cross-namespace bindmounts with (copy_tree+move_mount) to solve 6. But this solution does not help much with problem 1 - ordering and the need of temporary mounts. As you can see in test.sh you would still need to think hard to solve different similar configurations of reverse order between mounts and sharing groups. - We can actually prohibit cross-namespace MS_SET_GROUP if you like. (If both namespaces are non abstract.) We can use open_tree to create a copy of the mount with
Re: [CRIU] [PATCH] mnt: allow to add a mount into an existing group
Hi! Can we restart the discussion on this topic? In CRIU we need to be able to dump/restore all mount trees of system container (CT). CT can have anything inside - users which create their custom mounts configuration, systemd with custom mount namespaces for it's services, nested application containers inside the CT with their own mount namespaces, and all mounts in CT mount trees can be grouped by sharing groupes (e.g. same shared_id + master_id pair), and those groups can depend one from another forming a tree structure of sharing groups. 1) Imagine that we have this sharing group tree (in format (shared_id, master_id), 0 means no sharing, we don't care about actual mounts for now only master-slave dependencies between sharing groups): (1,0) |- (2,1) |- (3,1) |- (4,3) |- (0,4) The main problem of restoring mounts is the fact that sharing groups currently can be only inherited, e.g. if you have one mount (first) with shared_id = x, master_id = y, the only way to get another mount with (x,y) is to create a bindmount from the first mount. Also to create mount (y,z) from mount (x,y) one should also first inherit (x,y) via bindmount and than change to (y,z). This means that mentioned above tree puts restriction on the mounts creation order, one need to have at least one mount for each of sharing groups (1,0), (3,1) and (4,3) before creating the first mount of the sharing group (0,4). But what if we want to mount (restore) actual mounts in this mount tree "reverse" order: mntid parent mountpoint (shared_id, master_id) 101 0 /tmp(0,4) 102 101 /tmp(4,3) 103 102 /tmp(3,1) 104 103 /tmp(1,0) Mount 104's sharing group should be created before mount 101, 102 and 103 sharing groups, but mount 104 should be created after those mounts. One can actually prepare this setup (on mainstream kernel) by pre-creating sharing groups elsewhere and then binding to /tmp in proper order with careful unmounting of propagations (see test.sh attached): [root@snorch propagation-tests]# bash ../test.sh 960 1120 0:56 / /tmp/propagation-tests/tmp rw,relatime master:452 - tmpfs propagation-tests-src rw,inode64 958 960 0:56 / /tmp/propagation-tests/tmp/sub rw,relatime shared:452 master:451 - tmpfs propagation-tests-src rw,inode64 961 958 0:56 / /tmp/propagation-tests/tmp/sub/sub rw,relatime shared:451 master:433 - tmpfs propagation-tests-src rw,inode64 963 961 0:56 / /tmp/propagation-tests/tmp/sub/sub/sub rw,relatime shared:433 - tmpfs propagation-tests-src rw,inode64 But this "pre-creating" from test.sh is not universal at all and only works for this simple case. CRIU does not know anything about the history of mount creation for system container, it also does not know anything about any temporary mounts which were used and then removed. So understanding the proper order is almost impossible like Andrew says. I've also prepared a presentation on Linux Plumbers last year about how much problems propagation brings to mounts restore in CRIU, you can take a look here https://www.linuxplumbersconf.org/event/7/contributions/640/ 2) Propagation creates tons of mounts 3) Mount reparenting 4) "Mount trap" 5) "Non-uniform" propagation 6) “Cross-namespace” sharing groups Allowing to create mounts private first and create sharing groups later and copy sharing groups later instead of inheriting them resolves all the problems with propagation at once. One can take a look on the implementation of sharing group restore in CRIU if we have this (mnt: allow to add a mount into an existing group) patch applied: https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L898 Obviously this does not solve all the problems with mounts I know about but it's a big step forward in properly supporting them in CRIU. We already have this tested in Virtuozzo for almost a year and it works nice. Notes: - There is another idea, but I should say early that I don't like it, because with it restoring mounts with criu would be still super complex. We can add extra flag to mount/move_mount syscall to disable propagation temporary so that CRIU can restore the mount tree without problems 2-5, also we can now create cross-namespace bindmounts with (copy_tree+move_mount) to solve 6. But this solution does not help much with problem 1 - ordering and the need of temporary mounts. As you can see in test.sh you would still need to think hard to solve different similar configurations of reverse order between mounts and sharing groups. - We can actually prohibit cross-namespace MS_SET_GROUP if you like. (If both namespaces are non abstract.) We can use open_tree to create a copy of the mount with the same sharing group and only then copy sharing from the copy while being in proper mountns. - We still need it: > this code might be made
Re: [PATCH] fcntl: make F_GETOWN(EX) return 0 on dead owner task
On 2/8/21 3:31 PM, Jeff Layton wrote: On Thu, 2021-02-04 at 01:17 +0300, Cyrill Gorcunov wrote: On Thu, Feb 04, 2021 at 12:35:42AM +0300, Pavel Tikhomirov wrote: AFAICS if pid is held only by 1) fowner refcount and by 2) single process (without threads, group and session for simplicity), on process exit we go through: do_exit exit_notify release_task __exit_signal __unhash_process detach_pid __change_pid free_pid idr_remove So pid is removed from idr, and after that alloc_pid can reuse pid numbers even if old pid structure is still alive and is still held by fowner. ... Hope this answers your question, Thanks! Yeah, indeed, thanks! So the change is sane still I'm a bit worried about backward compatibility, gimme some time I'll try to refresh my memory first, in a couple of days or weekend (though here are a number of experienced developers CC'ed maybe they reply even faster). I always find it helpful to refer to the POSIX spec [1] for this sort of thing. In this case, it says: F_GETOWN If fildes refers to a socket, get the process ID or process group ID specified to receive SIGURG signals when out-of-band data is available. Positive values shall indicate a process ID; negative values, other than -1, shall indicate a process group ID; the value zero shall indicate that no SIGURG signals are to be sent. If fildes does not refer to a socket, the results are unspecified. In the event that the PID is reused, the kernel won't send signals to the replacement task, correct? Correct. Looks like only places to send signal to owner are send_sigio() and send_sigurg() (at least nobody else dereferences fown->pid_type). And in both places we lookup for task to send signal to with pid_task() or do_each_pid_task() (similar to what I do in patch) and will not find any task if pid was reused. Thus no signal would be sent. Assuming that's the case, then this patch looks fine to me too. I'll plan to pick it for linux-next later today, and we can hopefully get this into v5.12. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/fcntl.html#tag_16_122 Thanks for finding it! -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] fcntl: make F_GETOWN(EX) return 0 on dead owner task
On 2/3/21 10:32 PM, Cyrill Gorcunov wrote: On Wed, Feb 03, 2021 at 03:41:56PM +0300, Pavel Tikhomirov wrote: Currently there is no way to differentiate the file with alive owner from the file with dead owner but pid of the owner reused. That's why CRIU can't actually know if it needs to restore file owner or not, because if it restores owner but actual owner was dead, this can introduce unexpected signals to the "false"-owner (which reused the pid). Hi! Thanks for the patch. You know I manage to forget the fowner internals. Could you please enlighten me -- when owner is set with some pid we do f_setown_ex __f_setown f_modown filp->f_owner.pid = get_pid(pid); Thus pid get refcount incremented. Hi, and yes you are right about refcount is held. Then the owner exits but refcounter should be still up and running and pid should not be reused, no? Or I miss something obvious? AFAICS if pid is held only by 1) fowner refcount and by 2) single process (without threads, group and session for simplicity), on process exit we go through: do_exit exit_notify release_task __exit_signal __unhash_process detach_pid __change_pid free_pid idr_remove So pid is removed from idr, and after that alloc_pid can reuse pid numbers even if old pid structure is still alive and is still held by fowner. Also I've added criu-zdtm test which reproduces the problem: https://src.openvz.org/projects/OVZ/repos/criu/commits/e25904c35dbc535f6837e55da58ca0f5a5caf4b3#test/zdtm/static/file_fown_reuse.c Hope this answers your question, Thanks! The patch itself looks ok on a first glance. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
[PATCH] fcntl: make F_GETOWN(EX) return 0 on dead owner task
Currently there is no way to differentiate the file with alive owner from the file with dead owner but pid of the owner reused. That's why CRIU can't actually know if it needs to restore file owner or not, because if it restores owner but actual owner was dead, this can introduce unexpected signals to the "false"-owner (which reused the pid). Let's change the api, so that F_GETOWN(EX) returns 0 in case actual owner is dead already. Cc: Jeff Layton Cc: "J. Bruce Fields" Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Cyrill Gorcunov Cc: Andrei Vagin Signed-off-by: Pavel Tikhomirov --- fs/fcntl.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 05b36b28f2e8..483ef8861376 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -148,11 +148,15 @@ void f_delown(struct file *filp) pid_t f_getown(struct file *filp) { - pid_t pid; + pid_t pid = 0; read_lock(&filp->f_owner.lock); - pid = pid_vnr(filp->f_owner.pid); - if (filp->f_owner.pid_type == PIDTYPE_PGID) - pid = -pid; + rcu_read_lock(); + if (pid_task(filp->f_owner.pid, filp->f_owner.pid_type)) { + pid = pid_vnr(filp->f_owner.pid); + if (filp->f_owner.pid_type == PIDTYPE_PGID) + pid = -pid; + } + rcu_read_unlock(); read_unlock(&filp->f_owner.lock); return pid; } @@ -200,11 +204,14 @@ static int f_setown_ex(struct file *filp, unsigned long arg) static int f_getown_ex(struct file *filp, unsigned long arg) { struct f_owner_ex __user *owner_p = (void __user *)arg; - struct f_owner_ex owner; + struct f_owner_ex owner = {}; int ret = 0; read_lock(&filp->f_owner.lock); - owner.pid = pid_vnr(filp->f_owner.pid); + rcu_read_lock(); + if (pid_task(filp->f_owner.pid, filp->f_owner.pid_type)) + owner.pid = pid_vnr(filp->f_owner.pid); + rcu_read_unlock(); switch (filp->f_owner.pid_type) { case PIDTYPE_PID: owner.type = F_OWNER_TID; -- 2.26.2
[PATCH v5 2/2] ovl: introduce new "uuid=off" option for inodes index feature
This replaces uuid with null in overlayfs file handles and thus relaxes uuid checks for overlay index feature. It is only possible in case there is only one filesystem for all the work/upper/lower directories and bare file handles from this backing filesystem are unique. In other case when we have multiple filesystems lets just fallback to "uuid=on" which is and equivalent of how it worked before with all uuid checks. This is needed when overlayfs is/was mounted in a container with index enabled (e.g.: to be able to resolve inotify watch file handles on it to paths in CRIU), and this container is copied and started alongside with the original one. This way the "copy" container can't have the same uuid on the superblock and mounting the overlayfs from it later would fail. That is an example of the problem on top of loop+ext4: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when create the copy of container and we require fs uuid to be unique for a new container. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- v2: in v1 I missed actual uuid check skip v3: rebase to overlayfs-next, replace uuid with null in file handles, split ovl_fs propagation to function arguments to separate patch, add separate bool "uuid=on/off" option, move numfs check up, add doc note. v4: get rid of double negatives, remove nouuid leftower comment, fix misprint in kernel config name v5: fix typos; remove config option, module param, ovl_uuid_def and the corresponding notes Signed-off-by: Pavel Tikhomirov --- Documentation/filesystems/overlayfs.rst | 5 + fs/overlayfs/copy_up.c | 3 ++- fs/overlayfs/namei.c| 4 +++- fs/overlayfs/ovl_entry.h| 1 + fs/overlayfs/super.c| 20 5 files changed, 31 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 580ab9a0fe31..a3e588dc8437 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -563,6 +563,11 @@ This verification may cause significant overhead in some cases. Note: the mount options index=off,nfs_export=on are conflicting for a read-write mount and will result in an error. +Note: the mount option uuid=off can be used to replace UUID of the underlying +filesystem in file handles with null, and effectively disable UUID checks. This +can be useful in case the underlying disk is copied and the UUID of this copy +is changed. This is only applicable if all lower/upper/work directories are on +the same filesystem, otherwise it will fallback to normal behaviour. Volatile mount -- diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c index 3380039036d6..0b7e7a90a435 100644 --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -320,7 +320,8 @@ struct ovl_fh *ovl_encode_real_fh(struct ovl_fs *ofs, struct dentry *real, if (is_upper) fh->fb.flags |= OVL_FH_FLAG_PATH_UPPER; fh->fb.len = sizeof(fh->fb) + buflen; - fh->fb.uuid = *uuid; + if (ofs->config.uuid) + fh->fb.uuid = *uuid; return fh; diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c index f058bf8e8b87..f731eb4d35f9 100644 --- a/fs/overlayfs/namei.c +++ b/fs/overlayfs/namei.c @@ -159,8 +159,10 @@ struct dentry *ovl_decode_real_fh(struct ovl_fs *ofs, struct ovl_fh *fh, /* * Make sure that the stored uuid matches the uuid of the lower * layer where file handle will be decoded. +* In case of uuid=off option just make sure that stored uuid is null. */ - if (!uuid_equal(&fh->fb.uuid, &mnt->mnt_sb->s_uuid)) + if (ofs->config.uuid ? !uuid_equal(&fh->fb.uuid, &mnt->mnt_sb->s_uuid) : + !uuid_is_null(&fh->fb.uuid)) return NULL; bytes = (fh->fb.len - offsetof(struct ovl_fb, fid)); diff --git a/fs/overlay
[PATCH v5 1/2] ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh
This will be used in next patch to be able to change uuid checks and add uuid nullification based on ofs->config.index for a new "uuid=off" mode. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Reviewed-by: Amir Goldstein Signed-off-by: Pavel Tikhomirov --- fs/overlayfs/copy_up.c | 22 -- fs/overlayfs/export.c| 10 ++ fs/overlayfs/namei.c | 19 ++- fs/overlayfs/overlayfs.h | 14 -- fs/overlayfs/util.c | 3 ++- 5 files changed, 38 insertions(+), 30 deletions(-) diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c index 955ecd4030f0..3380039036d6 100644 --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -275,7 +275,8 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat) return err; } -struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper) +struct ovl_fh *ovl_encode_real_fh(struct ovl_fs *ofs, struct dentry *real, + bool is_upper) { struct ovl_fh *fh; int fh_type, dwords; @@ -328,8 +329,8 @@ struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper) return ERR_PTR(err); } -int ovl_set_origin(struct dentry *dentry, struct dentry *lower, - struct dentry *upper) +int ovl_set_origin(struct ovl_fs *ofs, struct dentry *dentry, + struct dentry *lower, struct dentry *upper) { const struct ovl_fh *fh = NULL; int err; @@ -340,7 +341,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower, * up and a pure upper inode. */ if (ovl_can_decode_fh(lower->d_sb)) { - fh = ovl_encode_real_fh(lower, false); + fh = ovl_encode_real_fh(ofs, lower, false); if (IS_ERR(fh)) return PTR_ERR(fh); } @@ -362,7 +363,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper, const struct ovl_fh *fh; int err; - fh = ovl_encode_real_fh(upper, true); + fh = ovl_encode_real_fh(ofs, upper, true); if (IS_ERR(fh)) return PTR_ERR(fh); @@ -380,6 +381,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper, static int ovl_create_index(struct dentry *dentry, struct dentry *origin, struct dentry *upper) { + struct ovl_fs *ofs = OVL_FS(dentry->d_sb); struct dentry *indexdir = ovl_indexdir(dentry->d_sb); struct inode *dir = d_inode(indexdir); struct dentry *index = NULL; @@ -402,7 +404,7 @@ static int ovl_create_index(struct dentry *dentry, struct dentry *origin, if (WARN_ON(ovl_test_flag(OVL_INDEX, d_inode(dentry return -EIO; - err = ovl_get_index_name(origin, &name); + err = ovl_get_index_name(ofs, origin, &name); if (err) return err; @@ -411,7 +413,7 @@ static int ovl_create_index(struct dentry *dentry, struct dentry *origin, if (IS_ERR(temp)) goto free_name; - err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), upper, temp); + err = ovl_set_upper_fh(ofs, upper, temp); if (err) goto out; @@ -521,7 +523,7 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp) * hard link. */ if (c->origin) { - err = ovl_set_origin(c->dentry, c->lowerpath.dentry, temp); + err = ovl_set_origin(ofs, c->dentry, c->lowerpath.dentry, temp); if (err) return err; } @@ -700,7 +702,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c) static int ovl_do_copy_up(struct ovl_copy_up_ctx *c) { int err; - struct ovl_fs *ofs = c->dentry->d_sb->s_fs_info; + struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb); bool to_index = false; /* @@ -722,7 +724,7 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c) if (to_index) { c->destdir = ovl_indexdir(c->dentry->d_sb); - err = ovl_get_index_name(c->lowerpath.dentry, &c->destname); + err = ovl_get_index_name(ofs, c->lowerpath.dentry, &c->destname); if (err) return err; } else if (WARN_ON(!c->parent)) { diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index ed35be3fafc6..41ebf52f1bbc 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -211,7 +211,8 @@ static int ovl_check_encode_origin(struct dentry *dentry) return 1; } -static int ovl_dentry_to_fid(struct dentry *dentry, u32 *fid, int buflen) +static int ovl_dentry_to_fid(struct ovl_fs *ofs, struct dentry *dentry, +u32 *fid, int buflen)
[PATCH v5 0/2] ovl introduce "uuid=off"
This is a v5 of: ovl: introduce new "index=nouuid" option for inodes index feature Changes in v3: rebase to overlayfs-next, replace uuid with null in file handles, propagate ovl_fs to needed functions in a separate patch, add separate bool "uuid=on/off" option, fix numfs check fallback, add a note to docs. Changes in v4: get rid of double negatives, remove nouuid leftower comment, fix missprint in kernel config name. Changes in v5: fix typos; remove config option and module param. Amir, as second patch had changed quiet a bit, I don't add you reviewed-by to it. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov Pavel Tikhomirov (2): ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh ovl: introduce new "uuid=off" option for inodes index feature Documentation/filesystems/overlayfs.rst | 5 + fs/overlayfs/copy_up.c | 25 ++--- fs/overlayfs/export.c | 10 ++ fs/overlayfs/namei.c| 23 +-- fs/overlayfs/overlayfs.h| 14 -- fs/overlayfs/ovl_entry.h| 1 + fs/overlayfs/super.c| 20 fs/overlayfs/util.c | 3 ++- 8 files changed, 69 insertions(+), 32 deletions(-) -- 2.26.2
Re: [PATCH v4 2/2] ovl: introduce new "uuid=off" option for inodes index feature
On 10/6/20 6:13 PM, Miklos Szeredi wrote: On Fri, Sep 25, 2020 at 10:35 AM Pavel Tikhomirov wrote: Note: In our (Virtuozzo) use case users inside a container can create "regular" overlayfs mounts without any "index=" option, but we still want to migrate this containers with CRIU so we set "index=on" as kernel default so that all the container overlayfs mounts get support of file handles automatically. With "uuid=off" we want the same thing (to be able to "copy" container with uuid change) - we would set kernel default so that all the container overlayfs mounts get "uuid=off" automatically. I'm not sure I buy that argument for a kernel option. It should rather be a "container" option in that case, but AFAIK the kernel doesn't have a concept of a container. I think this needs to be discussed on the relevant mailing lists. As of now mainline kernel doesn't support unprivileged overlay mounts, so I guess this is not an issue. Let's just merge this without the kernel and the module options. Virtuozzo kernel does have a "container" concept and we do have unprivileged overlay mounts to support docker inside Virtuozzo containers. We don't face any major issues with it. But you are right it's not mainstream. Probably a normal user of mainstream kernel also might want to set index=on+uuid=off by default, so that all their docker containters automatically support inotifies and survive backing disk uuid change automaticaly. I will prepare next patchset version without default. Thanks, Miklos -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH 0/4] fs: add mount_setattr()
mount_setattr() can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout: struct mount_attr { __u64 attr_set; __u64 attr_clr; __u32 propagation; __u32 atime; }; We probably can rework "mnt: allow to add a mount into an existing group" (MS_SET_GROUP https://lkml.org/lkml/2017/1/23/712) to an extension mount_setattr. Do anyone have any objections? We need it in CRIU because it is a big problem to restore complex mount trees with complex propagation flags (see my LPC talk for details https://linuxplumbersconf.org/event/7/contributions/640/) of system-containers. If we allow set(copy) sharing options we can separate mount tree restore and propagation restore and everything becomes much simpler. (And we already have CRIU implementation based on it, which helps with variety of bugs with mounts we previously had. https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#880) I've also tried to consider another approach https://github.com/Snorch/linux/commit/84886f588527b062993ec3e9760c879163852518 to disable actual propagation while restoring the mount tree. But with this approach it looks like it would be still hard to restore in CRIU because: With "MS_SET_GROUP" we don't care about roots. Imagine we have mount A (shared_id=1, root="/some/sub/path") and mount B (master_id=1, root="/some/other/sub/path", with MS_SET_GROUP we can copy sharing from mount A to B and make B MS_SLAVE to restore them. In case we want to do the same with only inheritance we would have to have some helper mount in this share C (shared_id=1, root="/") so that B can inherit this share... -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH v4 2/2] ovl: introduce new "uuid=off" option for inodes index feature
On 9/25/20 7:42 PM, Amir Goldstein wrote: Apart from some typos, looks good to me. Amir, Thanks a lot for your review! > you should wait for more feedback from others Sure, will wait. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH v3 2/2] ovl: introduce new "uuid=off" option for inodes index feature
Thanks for review! I've sent v4 to address all your comments. I've also noticed that I had inconsistent config option naming: +config OVERLAY_FS_INDEX_UUID_OFF +static bool ovl_uuid_off_def = IS_ENABLED(CONFIG_OVERLAY_FS_UUID_OFF); this is also fixed. On 9/24/20 8:02 PM, Amir Goldstein wrote: On Thu, Sep 24, 2020 at 7:38 PM Pavel Tikhomirov wrote: This replaces uuid with null in overelayfs file handles and thus relaxes uuid checks for overlay index feature. It is only possible in case there is only one filesystem for all the work/upper/lower directories and bare file handles from this backing filesystem are uniq. In other case when we have multiple filesystems lets just fallback to "uuid=on" which is and equivalent of how it worked before with all uuid checks. This is needed when overlayfs is/was mounted in a container with index enabled (e.g.: to be able to resolve inotify watch file handles on it to paths in CRIU), and this container is copied and started alongside with the original one. This way the "copy" container can't have the same uuid on the superblock and mounting the overlayfs from it later would fail. Note: In our (Virtuozzo) use case users inside a container can create "regular" overlayfs mounts without any "index=" option, but we still want to migrate this containers with CRIU so we set "index=on" as kernel default so that all the container overlayfs mounts get support of file handles automatically. With "uuid=off" we want the same thing (to be able to "copy" container with uuid change) - we would set kernel default so that all the container overlayfs mounts get "uuid=off" automatically. That is an example of the problem on top of loop+ext4: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when crate the copy of container and we require fs uuid to be uniq for a new container. v2: in v1 I missed actual uuid check skip v3: rebase to overlayfs-next, replace uuid with null in file handles, split ovl_fs propagation to function arguments to separate patch, add separate bool "uuid=on/off" option, move numfs check up, add doc note. change log does not belong in commit message. Move after --- please. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- Documentation/filesystems/overlayfs.rst | 6 ++ fs/overlayfs/Kconfig| 17 + fs/overlayfs/copy_up.c | 3 ++- fs/overlayfs/namei.c| 5 - fs/overlayfs/ovl_entry.h| 1 + fs/overlayfs/super.c| 25 + 6 files changed, 55 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 580ab9a0fe31..4f9cc20f255c 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -563,6 +563,12 @@ This verification may cause significant overhead in some cases. Note: the mount options index=off,nfs_export=on are conflicting for a read-write mount and will result in an error. +Note: the mount option uuid=off (or corresponding module param, or kernel +config) can be used to replace UUID of the underlying filesystem in file +handles with null, and effectively disable UUID checks. This can be useful in +case the underlying disk is copied and the UUID of this copy is changed. This +is only applicable if all lower/upper/work directories are on the same +filesystem, otherwise it will fallback to normal behaviour. Volatile mount -- diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig index dd188c7996b3..888c6e5e71ee 100644 --- a/fs/overlayfs/Kconfig +++ b/fs/overlayfs/Kconfig @@ -61,6 +61,23 @@ config OVERLAY_FS_INDEX If unsure, say N. +config OVERLAY_FS_INDEX_UUID_OFF Please get rid of all the double negatives. config/param uuid_off should be uuid and default to Y. Otherwise this looks fine to me. + bool "Overlayfs: export null uuid in file handles&qu
[PATCH v4 2/2] ovl: introduce new "uuid=off" option for inodes index feature
This replaces uuid with null in overelayfs file handles and thus relaxes uuid checks for overlay index feature. It is only possible in case there is only one filesystem for all the work/upper/lower directories and bare file handles from this backing filesystem are uniq. In other case when we have multiple filesystems lets just fallback to "uuid=on" which is and equivalent of how it worked before with all uuid checks. This is needed when overlayfs is/was mounted in a container with index enabled (e.g.: to be able to resolve inotify watch file handles on it to paths in CRIU), and this container is copied and started alongside with the original one. This way the "copy" container can't have the same uuid on the superblock and mounting the overlayfs from it later would fail. Note: In our (Virtuozzo) use case users inside a container can create "regular" overlayfs mounts without any "index=" option, but we still want to migrate this containers with CRIU so we set "index=on" as kernel default so that all the container overlayfs mounts get support of file handles automatically. With "uuid=off" we want the same thing (to be able to "copy" container with uuid change) - we would set kernel default so that all the container overlayfs mounts get "uuid=off" automatically. That is an example of the problem on top of loop+ext4: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when crate the copy of container and we require fs uuid to be uniq for a new container. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- v2: in v1 I missed actual uuid check skip v3: rebase to overlayfs-next, replace uuid with null in file handles, split ovl_fs propagation to function arguments to separate patch, add separate bool "uuid=on/off" option, move numfs check up, add doc note. v4: get rid of double negatives, remove nouuid leftower comment, fix missprint in kernel config name Documentation/filesystems/overlayfs.rst | 6 ++ fs/overlayfs/Kconfig| 19 +++ fs/overlayfs/copy_up.c | 3 ++- fs/overlayfs/namei.c| 4 +++- fs/overlayfs/ovl_entry.h| 1 + fs/overlayfs/super.c| 25 + 6 files changed, 56 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 580ab9a0fe31..4f9cc20f255c 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -563,6 +563,12 @@ This verification may cause significant overhead in some cases. Note: the mount options index=off,nfs_export=on are conflicting for a read-write mount and will result in an error. +Note: the mount option uuid=off (or corresponding module param, or kernel +config) can be used to replace UUID of the underlying filesystem in file +handles with null, and effectively disable UUID checks. This can be useful in +case the underlying disk is copied and the UUID of this copy is changed. This +is only applicable if all lower/upper/work directories are on the same +filesystem, otherwise it will fallback to normal behaviour. Volatile mount -- diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig index dd188c7996b3..c21abdb43206 100644 --- a/fs/overlayfs/Kconfig +++ b/fs/overlayfs/Kconfig @@ -61,6 +61,25 @@ config OVERLAY_FS_INDEX If unsure, say N. +config OVERLAY_FS_INDEX_UUID + bool "Overlayfs: export uuid in file handles" + default y + depends on OVERLAY_FS + help + If this config option is disabled then overlay will replace uuid with + null in overlayfs file handles, effectively disabling uuid checks for + them. This affects overlayfs mounted with "index=on". This only can be + done if all upper and lower directories are on the same filesystem + where basic fhandles are uniq. In case the latter is not true + overlayfs wou
[PATCH v4 1/2] ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh
This will be used in next patch to be able to change uuid checks and add uuid nullification based on ofs->config.index for a new "uuid=off" mode. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- fs/overlayfs/copy_up.c | 22 -- fs/overlayfs/export.c| 10 ++ fs/overlayfs/namei.c | 19 ++- fs/overlayfs/overlayfs.h | 14 -- fs/overlayfs/util.c | 3 ++- 5 files changed, 38 insertions(+), 30 deletions(-) diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c index 955ecd4030f0..3380039036d6 100644 --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -275,7 +275,8 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat) return err; } -struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper) +struct ovl_fh *ovl_encode_real_fh(struct ovl_fs *ofs, struct dentry *real, + bool is_upper) { struct ovl_fh *fh; int fh_type, dwords; @@ -328,8 +329,8 @@ struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper) return ERR_PTR(err); } -int ovl_set_origin(struct dentry *dentry, struct dentry *lower, - struct dentry *upper) +int ovl_set_origin(struct ovl_fs *ofs, struct dentry *dentry, + struct dentry *lower, struct dentry *upper) { const struct ovl_fh *fh = NULL; int err; @@ -340,7 +341,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower, * up and a pure upper inode. */ if (ovl_can_decode_fh(lower->d_sb)) { - fh = ovl_encode_real_fh(lower, false); + fh = ovl_encode_real_fh(ofs, lower, false); if (IS_ERR(fh)) return PTR_ERR(fh); } @@ -362,7 +363,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper, const struct ovl_fh *fh; int err; - fh = ovl_encode_real_fh(upper, true); + fh = ovl_encode_real_fh(ofs, upper, true); if (IS_ERR(fh)) return PTR_ERR(fh); @@ -380,6 +381,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper, static int ovl_create_index(struct dentry *dentry, struct dentry *origin, struct dentry *upper) { + struct ovl_fs *ofs = OVL_FS(dentry->d_sb); struct dentry *indexdir = ovl_indexdir(dentry->d_sb); struct inode *dir = d_inode(indexdir); struct dentry *index = NULL; @@ -402,7 +404,7 @@ static int ovl_create_index(struct dentry *dentry, struct dentry *origin, if (WARN_ON(ovl_test_flag(OVL_INDEX, d_inode(dentry return -EIO; - err = ovl_get_index_name(origin, &name); + err = ovl_get_index_name(ofs, origin, &name); if (err) return err; @@ -411,7 +413,7 @@ static int ovl_create_index(struct dentry *dentry, struct dentry *origin, if (IS_ERR(temp)) goto free_name; - err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), upper, temp); + err = ovl_set_upper_fh(ofs, upper, temp); if (err) goto out; @@ -521,7 +523,7 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp) * hard link. */ if (c->origin) { - err = ovl_set_origin(c->dentry, c->lowerpath.dentry, temp); + err = ovl_set_origin(ofs, c->dentry, c->lowerpath.dentry, temp); if (err) return err; } @@ -700,7 +702,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c) static int ovl_do_copy_up(struct ovl_copy_up_ctx *c) { int err; - struct ovl_fs *ofs = c->dentry->d_sb->s_fs_info; + struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb); bool to_index = false; /* @@ -722,7 +724,7 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c) if (to_index) { c->destdir = ovl_indexdir(c->dentry->d_sb); - err = ovl_get_index_name(c->lowerpath.dentry, &c->destname); + err = ovl_get_index_name(ofs, c->lowerpath.dentry, &c->destname); if (err) return err; } else if (WARN_ON(!c->parent)) { diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index ed35be3fafc6..41ebf52f1bbc 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -211,7 +211,8 @@ static int ovl_check_encode_origin(struct dentry *dentry) return 1; } -static int ovl_dentry_to_fid(struct dentry *dentry, u32 *fid, int buflen) +static int ovl_dentry_to_fid(struct ovl_fs *ofs, struct dentry *dentry, +u32 *fid, int buflen) { struct ovl_fh *fh = NULL;
[PATCH v4 0/2] ovl introduce "uuid=off"
This is a v4 of: ovl: introduce new "index=nouuid" option for inodes index feature Changes in v3: rebase to overlayfs-next, replace uuid with null in file handles, propagate ovl_fs to needed functions in a separate patch, add separate bool "uuid=on/off" option, fix numfs check fallback, add a note to docs. Changes in v4: get rid of double negatives, remove nouuid leftower comment, fix missprint in kernel config name. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov Pavel Tikhomirov (2): ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh ovl: introduce new "uuid=off" option for inodes index feature Documentation/filesystems/overlayfs.rst | 6 ++ fs/overlayfs/Kconfig| 19 +++ fs/overlayfs/copy_up.c | 25 ++--- fs/overlayfs/export.c | 10 ++ fs/overlayfs/namei.c| 23 +-- fs/overlayfs/overlayfs.h| 14 -- fs/overlayfs/ovl_entry.h| 1 + fs/overlayfs/super.c| 25 + fs/overlayfs/util.c | 3 ++- 9 files changed, 94 insertions(+), 32 deletions(-) -- 2.26.2
[PATCH v3 2/2] ovl: introduce new "uuid=off" option for inodes index feature
This replaces uuid with null in overelayfs file handles and thus relaxes uuid checks for overlay index feature. It is only possible in case there is only one filesystem for all the work/upper/lower directories and bare file handles from this backing filesystem are uniq. In other case when we have multiple filesystems lets just fallback to "uuid=on" which is and equivalent of how it worked before with all uuid checks. This is needed when overlayfs is/was mounted in a container with index enabled (e.g.: to be able to resolve inotify watch file handles on it to paths in CRIU), and this container is copied and started alongside with the original one. This way the "copy" container can't have the same uuid on the superblock and mounting the overlayfs from it later would fail. Note: In our (Virtuozzo) use case users inside a container can create "regular" overlayfs mounts without any "index=" option, but we still want to migrate this containers with CRIU so we set "index=on" as kernel default so that all the container overlayfs mounts get support of file handles automatically. With "uuid=off" we want the same thing (to be able to "copy" container with uuid change) - we would set kernel default so that all the container overlayfs mounts get "uuid=off" automatically. That is an example of the problem on top of loop+ext4: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when crate the copy of container and we require fs uuid to be uniq for a new container. v2: in v1 I missed actual uuid check skip v3: rebase to overlayfs-next, replace uuid with null in file handles, split ovl_fs propagation to function arguments to separate patch, add separate bool "uuid=on/off" option, move numfs check up, add doc note. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- Documentation/filesystems/overlayfs.rst | 6 ++ fs/overlayfs/Kconfig| 17 + fs/overlayfs/copy_up.c | 3 ++- fs/overlayfs/namei.c| 5 - fs/overlayfs/ovl_entry.h| 1 + fs/overlayfs/super.c| 25 + 6 files changed, 55 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 580ab9a0fe31..4f9cc20f255c 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -563,6 +563,12 @@ This verification may cause significant overhead in some cases. Note: the mount options index=off,nfs_export=on are conflicting for a read-write mount and will result in an error. +Note: the mount option uuid=off (or corresponding module param, or kernel +config) can be used to replace UUID of the underlying filesystem in file +handles with null, and effectively disable UUID checks. This can be useful in +case the underlying disk is copied and the UUID of this copy is changed. This +is only applicable if all lower/upper/work directories are on the same +filesystem, otherwise it will fallback to normal behaviour. Volatile mount -- diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig index dd188c7996b3..888c6e5e71ee 100644 --- a/fs/overlayfs/Kconfig +++ b/fs/overlayfs/Kconfig @@ -61,6 +61,23 @@ config OVERLAY_FS_INDEX If unsure, say N. +config OVERLAY_FS_INDEX_UUID_OFF + bool "Overlayfs: export null uuid in file handles" + depends on OVERLAY_FS + help + If this config option is enabled then overlay will replace uuid with + null in overlayfs file handles, effectively disabling uuid checks for + them. This affects overlayfs mounted with "index=on". This only can be + done if all upper and lower directories are on the same filesystem + where basic fhandles are uniq. In case the latter is not true + overlayfs would fallback to normal uuid checking mode. + + It is needed to overcome possible change of uuid on superblock
[PATCH v3 0/2] ovl introduce "uuid=off"
This is a v3 of: ovl: introduce new "index=nouuid" option for inodes index feature Changes in v3: rebase to overlayfs-next, replace uuid with null in file handles, propagate ovl_fs to needed functions in a separate patch, add separate bool "uuid=on/off" option, fix numfs check fallback, add a note to docs. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov Pavel Tikhomirov (2): ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh ovl: introduce new "uuid=off" option for inodes index feature Documentation/filesystems/overlayfs.rst | 6 ++ fs/overlayfs/Kconfig| 17 + fs/overlayfs/copy_up.c | 25 ++--- fs/overlayfs/export.c | 10 ++ fs/overlayfs/namei.c| 24 ++-- fs/overlayfs/overlayfs.h| 14 -- fs/overlayfs/ovl_entry.h| 1 + fs/overlayfs/super.c| 25 + fs/overlayfs/util.c | 3 ++- 9 files changed, 93 insertions(+), 32 deletions(-) -- 2.26.2
[PATCH v3 1/2] ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh
This will be used in next patch to be able to change uuid checks and add uuid nullification based on ofs->config.index for a new "uuid=off" mode. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- fs/overlayfs/copy_up.c | 22 -- fs/overlayfs/export.c| 10 ++ fs/overlayfs/namei.c | 19 ++- fs/overlayfs/overlayfs.h | 14 -- fs/overlayfs/util.c | 3 ++- 5 files changed, 38 insertions(+), 30 deletions(-) diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c index 955ecd4030f0..3380039036d6 100644 --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -275,7 +275,8 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat) return err; } -struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper) +struct ovl_fh *ovl_encode_real_fh(struct ovl_fs *ofs, struct dentry *real, + bool is_upper) { struct ovl_fh *fh; int fh_type, dwords; @@ -328,8 +329,8 @@ struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper) return ERR_PTR(err); } -int ovl_set_origin(struct dentry *dentry, struct dentry *lower, - struct dentry *upper) +int ovl_set_origin(struct ovl_fs *ofs, struct dentry *dentry, + struct dentry *lower, struct dentry *upper) { const struct ovl_fh *fh = NULL; int err; @@ -340,7 +341,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower, * up and a pure upper inode. */ if (ovl_can_decode_fh(lower->d_sb)) { - fh = ovl_encode_real_fh(lower, false); + fh = ovl_encode_real_fh(ofs, lower, false); if (IS_ERR(fh)) return PTR_ERR(fh); } @@ -362,7 +363,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper, const struct ovl_fh *fh; int err; - fh = ovl_encode_real_fh(upper, true); + fh = ovl_encode_real_fh(ofs, upper, true); if (IS_ERR(fh)) return PTR_ERR(fh); @@ -380,6 +381,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct dentry *upper, static int ovl_create_index(struct dentry *dentry, struct dentry *origin, struct dentry *upper) { + struct ovl_fs *ofs = OVL_FS(dentry->d_sb); struct dentry *indexdir = ovl_indexdir(dentry->d_sb); struct inode *dir = d_inode(indexdir); struct dentry *index = NULL; @@ -402,7 +404,7 @@ static int ovl_create_index(struct dentry *dentry, struct dentry *origin, if (WARN_ON(ovl_test_flag(OVL_INDEX, d_inode(dentry return -EIO; - err = ovl_get_index_name(origin, &name); + err = ovl_get_index_name(ofs, origin, &name); if (err) return err; @@ -411,7 +413,7 @@ static int ovl_create_index(struct dentry *dentry, struct dentry *origin, if (IS_ERR(temp)) goto free_name; - err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), upper, temp); + err = ovl_set_upper_fh(ofs, upper, temp); if (err) goto out; @@ -521,7 +523,7 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp) * hard link. */ if (c->origin) { - err = ovl_set_origin(c->dentry, c->lowerpath.dentry, temp); + err = ovl_set_origin(ofs, c->dentry, c->lowerpath.dentry, temp); if (err) return err; } @@ -700,7 +702,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c) static int ovl_do_copy_up(struct ovl_copy_up_ctx *c) { int err; - struct ovl_fs *ofs = c->dentry->d_sb->s_fs_info; + struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb); bool to_index = false; /* @@ -722,7 +724,7 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c) if (to_index) { c->destdir = ovl_indexdir(c->dentry->d_sb); - err = ovl_get_index_name(c->lowerpath.dentry, &c->destname); + err = ovl_get_index_name(ofs, c->lowerpath.dentry, &c->destname); if (err) return err; } else if (WARN_ON(!c->parent)) { diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index ed35be3fafc6..41ebf52f1bbc 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -211,7 +211,8 @@ static int ovl_check_encode_origin(struct dentry *dentry) return 1; } -static int ovl_dentry_to_fid(struct dentry *dentry, u32 *fid, int buflen) +static int ovl_dentry_to_fid(struct ovl_fs *ofs, struct dentry *dentry, +u32 *fid, int buflen) { struct ovl_fh *fh = NULL;
Re: [PATCH v2] ovl: introduce new "index=nouuid" option for inodes index feature
On 9/23/20 7:09 PM, Amir Goldstein wrote: On Wed, Sep 23, 2020 at 6:23 PM Pavel Tikhomirov wrote: This relaxes uuid checks for overlay index feature. It is only possible in case there is only one filesystem for all the work/upper/lower directories and bare file handles from this backing filesystem are uniq. In case we have multiple filesystems here just fall back to normal "index=on". This is needed when overlayfs is/was mounted in a container with index enabled (e.g.: to be able to resolve inotify watch file handles on it to paths in CRIU), and this container is copied and started alongside with the original one. This way the "copy" container can't have the same uuid on the superblock and mounting the overlayfs from it later would fail. That is an example of the problem on top of loop+ext4: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. mount -t overlay overlay -oindex=nouuid,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when crate the copy of container and we require fs uuid to be uniq for a new container. v2: in v1 I missed actual uuid check skip - add it CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- Look reasonable, but you need to rebase over git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs-next Which makes a lot of your work unneeded. ofs is propagated to most of the relevant helpers and you should propagate it down to ovl_decode_real_fh(). Thanks! Will do. Some minor comments below... fs/overlayfs/Kconfig | 16 +++ fs/overlayfs/export.c| 6 ++-- fs/overlayfs/namei.c | 35 +++ fs/overlayfs/overlayfs.h | 23 +++ fs/overlayfs/ovl_entry.h | 2 +- fs/overlayfs/super.c | 61 +--- 6 files changed, 106 insertions(+), 37 deletions(-) diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig index dd188c7996b3..b00fd44006f9 100644 --- a/fs/overlayfs/Kconfig +++ b/fs/overlayfs/Kconfig @@ -61,6 +61,22 @@ config OVERLAY_FS_INDEX If unsure, say N. +config OVERLAY_FS_INDEX_NOUUID + bool "Overlayfs: relax uuid checks of inodes index feature" + depends on OVERLAY_FS + depends on OVERLAY_FS_INDEX + help + If this config option is enabled then overlay will skip uuid checks + for index lower to upper inode map, this only can be done if all + upper and lower directories are on the same filesystem where basic + fhandles are uniq. + + It is needed to overcome possible change of uuid on superblock of the + backing filesystem, e.g. when you copied the virtual disk and mount + both the copy of the disk and the original one at the same time. + + If unsure, say N. + Please do not add a config option for this. Isn't a mount option sufficient for your needs? Users inside Virtuozzo container can mount overlayfs inside the CT (we assume that they do "regular" mounts without any "index=" option as docker does) so we wan't to setup the default in kernel config, so that all "regular" mounts of the user become "index=nouuid" automatically, and thus we would be able to both migrate (CRIU inotify resolution by fhandle on dump) and copy (copy disk with uuid change) the container without problem. config OVERLAY_FS_NFS_EXPORT bool "Overlayfs: turn on NFS export feature by default" depends on OVERLAY_FS diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 0e696f72cf65..d53feb7547d9 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -676,11 +676,12 @@ static struct dentry *ovl_upper_fh_to_d(struct super_block *sb, struct ovl_fs *ofs = sb->s_fs_info; struct dentry *dentry; struct dentry *upper; + bool nouuid = ofs->config.index == OVL_INDEX_NOUUID; if (!ovl_upper_mnt(ofs)) return ERR_PTR(-EACCES); - upper = ovl_decode_real_fh(f
Re: [PATCH v2] ovl: introduce new "index=nouuid" option for inodes index feature
On 9/23/20 7:36 PM, Amir Goldstein wrote: @@ -414,7 +415,7 @@ static int ovl_check_origin(struct ovl_fs *ofs, struct dentry *upperdentry, * Return 0 on match, -ESTALE on mismatch, < 0 on error. */ static int ovl_verify_fh(struct dentry *dentry, const char *name, -const struct ovl_fh *fh) +const struct ovl_fh *fh, bool nouuid) { struct ovl_fh *ofh = ovl_get_fh(dentry, name); int err = 0; @@ -425,8 +426,14 @@ static int ovl_verify_fh(struct dentry *dentry, const char *name, if (IS_ERR(ofh)) return PTR_ERR(ofh); - if (fh->fb.len != ofh->fb.len || memcmp(&fh->fb, &ofh->fb, fh->fb.len)) + if (fh->fb.len != ofh->fb.len) { err = -ESTALE; + } else { + if (nouuid && !uuid_equal(&fh->fb.uuid, &ofh->fb.uuid)) + ofh->fb.uuid = fh->fb.uuid; + if (memcmp(&fh->fb, &ofh->fb, fh->fb.len)) + err = -ESTALE; + } On second thought I am wondering if we should do that differently. If users want to work with index=nouuid, they need to work with it from day 1. index=nouuid should export null uuid in NFS handles and write null uuid in trusted.overlay.origin xattr. So in ovl_encode_real_fh() you set null uuid and instead of relaxing uuid_equal() in ovl_decode_real_fh() you change it to uuid_is_null(). Do you have a problem with that for Virtuozzo use case? Actually we've enabled index=on by default in kernel config in Virtuozzo only in the new update which is not yet released. So probably we can switch to index=nouuid with null uuid in fh. Thanks, Amir. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
[PATCH v2] ovl: introduce new "index=nouuid" option for inodes index feature
This relaxes uuid checks for overlay index feature. It is only possible in case there is only one filesystem for all the work/upper/lower directories and bare file handles from this backing filesystem are uniq. In case we have multiple filesystems here just fall back to normal "index=on". This is needed when overlayfs is/was mounted in a container with index enabled (e.g.: to be able to resolve inotify watch file handles on it to paths in CRIU), and this container is copied and started alongside with the original one. This way the "copy" container can't have the same uuid on the superblock and mounting the overlayfs from it later would fail. That is an example of the problem on top of loop+ext4: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. mount -t overlay overlay -oindex=nouuid,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when crate the copy of container and we require fs uuid to be uniq for a new container. v2: in v1 I missed actual uuid check skip - add it CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- fs/overlayfs/Kconfig | 16 +++ fs/overlayfs/export.c| 6 ++-- fs/overlayfs/namei.c | 35 +++ fs/overlayfs/overlayfs.h | 23 +++ fs/overlayfs/ovl_entry.h | 2 +- fs/overlayfs/super.c | 61 +--- 6 files changed, 106 insertions(+), 37 deletions(-) diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig index dd188c7996b3..b00fd44006f9 100644 --- a/fs/overlayfs/Kconfig +++ b/fs/overlayfs/Kconfig @@ -61,6 +61,22 @@ config OVERLAY_FS_INDEX If unsure, say N. +config OVERLAY_FS_INDEX_NOUUID + bool "Overlayfs: relax uuid checks of inodes index feature" + depends on OVERLAY_FS + depends on OVERLAY_FS_INDEX + help + If this config option is enabled then overlay will skip uuid checks + for index lower to upper inode map, this only can be done if all + upper and lower directories are on the same filesystem where basic + fhandles are uniq. + + It is needed to overcome possible change of uuid on superblock of the + backing filesystem, e.g. when you copied the virtual disk and mount + both the copy of the disk and the original one at the same time. + + If unsure, say N. + config OVERLAY_FS_NFS_EXPORT bool "Overlayfs: turn on NFS export feature by default" depends on OVERLAY_FS diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 0e696f72cf65..d53feb7547d9 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -676,11 +676,12 @@ static struct dentry *ovl_upper_fh_to_d(struct super_block *sb, struct ovl_fs *ofs = sb->s_fs_info; struct dentry *dentry; struct dentry *upper; + bool nouuid = ofs->config.index == OVL_INDEX_NOUUID; if (!ovl_upper_mnt(ofs)) return ERR_PTR(-EACCES); - upper = ovl_decode_real_fh(fh, ovl_upper_mnt(ofs), true); + upper = ovl_decode_real_fh(fh, ovl_upper_mnt(ofs), true, nouuid); if (IS_ERR_OR_NULL(upper)) return upper; @@ -700,6 +701,7 @@ static struct dentry *ovl_lower_fh_to_d(struct super_block *sb, struct dentry *index = NULL; struct inode *inode; int err; + bool nouuid = ofs->config.index == OVL_INDEX_NOUUID; /* First lookup overlay inode in inode cache by origin fh */ err = ovl_check_origin_fh(ofs, fh, false, NULL, &stack); @@ -752,7 +754,7 @@ static struct dentry *ovl_lower_fh_to_d(struct super_block *sb, goto out_err; } if (index) { - err = ovl_verify_origin(index, origin.dentry, false); + err = ovl_verify_origin(index, origin.dentry, false, nouuid); if (err) goto out_err; } diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c index f7d4358db637..676170d719f8 100644 --- a/fs/overlayfs/namei.c ++
[PATCH] ovl: introduce new "index=nouuid" option for inodes index feature
This relaxes uuid checks for overlay index feature. It is only possible in case there is only one filesystem for all the work/upper/lower directories and bare file handles from this backing filesystem are uniq. In case we have multiple filesystems here just fall back to normal "index=on". This is needed when overlayfs is/was mounted in a container with index enabled (e.g.: to be able to resolve inotify watch file handles on it to paths in CRIU), and this container is copied and started alongside with the original one. This way the "copy" container can't have the same uuid on the superblock and mounting the overlayfs from it later would fail. That is an example of the problem on top of loop+ext4: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 /root/loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when crate the copy of container and we require fs uuid to be uniq for a new container. CC: Amir Goldstein CC: Vivek Goyal CC: Miklos Szeredi CC: linux-unio...@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- fs/overlayfs/Kconfig | 16 fs/overlayfs/ovl_entry.h | 2 +- fs/overlayfs/super.c | 56 ++-- 3 files changed, 59 insertions(+), 15 deletions(-) diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig index dd188c7996b3..b00fd44006f9 100644 --- a/fs/overlayfs/Kconfig +++ b/fs/overlayfs/Kconfig @@ -61,6 +61,22 @@ config OVERLAY_FS_INDEX If unsure, say N. +config OVERLAY_FS_INDEX_NOUUID + bool "Overlayfs: relax uuid checks of inodes index feature" + depends on OVERLAY_FS + depends on OVERLAY_FS_INDEX + help + If this config option is enabled then overlay will skip uuid checks + for index lower to upper inode map, this only can be done if all + upper and lower directories are on the same filesystem where basic + fhandles are uniq. + + It is needed to overcome possible change of uuid on superblock of the + backing filesystem, e.g. when you copied the virtual disk and mount + both the copy of the disk and the original one at the same time. + + If unsure, say N. + config OVERLAY_FS_NFS_EXPORT bool "Overlayfs: turn on NFS export feature by default" depends on OVERLAY_FS diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h index b429c80879ee..2fd2cc515ad2 100644 --- a/fs/overlayfs/ovl_entry.h +++ b/fs/overlayfs/ovl_entry.h @@ -13,7 +13,7 @@ struct ovl_config { bool redirect_dir; bool redirect_follow; const char *redirect_mode; - bool index; + int index; bool nfs_export; int xino; bool metacopy; diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 4b38141c2985..617a5083e659 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -38,10 +38,16 @@ module_param_named(redirect_always_follow, ovl_redirect_always_follow, MODULE_PARM_DESC(redirect_always_follow, "Follow redirects even if redirect_dir feature is turned off"); -static bool ovl_index_def = IS_ENABLED(CONFIG_OVERLAY_FS_INDEX); -module_param_named(index, ovl_index_def, bool, 0644); +#define OVL_INDEX_OFF 0 +#define OVL_INDEX_ON 1 +#define OVL_INDEX_NOUUID 2 + +static int ovl_index_def = IS_ENABLED(CONFIG_OVERLAY_FS_INDEX_NOUUID) ? + OVL_INDEX_NOUUID : + IS_ENABLED(CONFIG_OVERLAY_FS_INDEX); +module_param_named(index, ovl_index_def, int, 0644); MODULE_PARM_DESC(index, -"Default to on or off for the inodes index feature"); +"Default to on, off or nouuid for the inodes index feature"); static bool ovl_nfs_export_def = IS_ENABLED(CONFIG_OVERLAY_FS_NFS_EXPORT); module_param_named(nfs_export, ovl_nfs_export_def, bool, 0644); @@ -352,8 +358,18 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry) seq_puts(m, ",default_permissions"); if (strcmp(ofs->config.redirect_mode, ovl_redirect_mode_def()) != 0) seq_printf(m, ",redirect_dir=%s", ofs-&g
Re: [PATCH] ovl: introduce new "index=nouuid" option for inodes index feature
Please drop, I accidentally missed several hunks... On 9/23/20 3:50 PM, Pavel Tikhomirov wrote: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 /root/loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when crate the copy of container and we require fs uuid to be uniq for a new container. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary
On 8/4/20 8:43 AM, Andrei Vagin wrote: On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote: On 30.07.2020 17:34, Eric W. Biederman wrote: Kirill Tkhai writes: Currently, there is no a way to list or iterate all or subset of namespaces in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, but some also may be as open files, which are not attached to a process. When a namespace open fd is sent over unix socket and then closed, it is impossible to know whether the namespace exists or not. Also, even if namespace is exposed as attached to a process or as open file, iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because this multiplies at tasks and fds number. Could you describe with more details when you need to iterate namespaces? There are three ways to hold namespaces. * processes * bind-mounts * file descriptors When CRIU dumps a container, it enumirates all processes, collects file descriptors and mounts. This means that we will be able to collect all namespaces, doesn't it? Yes we can. But it would be much easier for us to have all namespaces in one place isn't it? And this patch-set has another non-CRIU use case. It can simplify a view to namespaces for a normal user. Lets consider some cases: Lets assume we have an empty (no processes) mount namespace M which is held by single open fd, which was put in a unix socket and closed, unix socket has single open fd to it which was in it's turn put to another unix socket and again and again until we reach unix socket max depth... How should normal user find this mount namespace M? Lets assume that M also has a nsfs bindmount which helds some empty network namespace N... How should normal user find N? Lets also assume that M has overmounted "/": mount -t tmpfs tmpfs / Now if you would enter M you would see single tmpfs (because of implicit chroot to overmount on setns) in mountinfo and there is no way to see full mountinfo if you does not know real root dentry... How should normal user (or even CRIU) find N? So my personal opinion is that we need this interface, maybe it should be done somehow different but we need it. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary
On 7/31/20 1:13 AM, Eric W. Biederman wrote: Kirill Tkhai writes: On 30.07.2020 17:34, Eric W. Biederman wrote: Kirill Tkhai writes: Currently, there is no a way to list or iterate all or subset of namespaces in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, but some also may be as open files, which are not attached to a process. When a namespace open fd is sent over unix socket and then closed, it is impossible to know whether the namespace exists or not. Also, even if namespace is exposed as attached to a process or as open file, iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because this multiplies at tasks and fds number. I am very dubious about this. I have been avoiding exactly this kind of interface because it can create rather fundamental problems with checkpoint restart. restart/restore :) You do have some filtering and the filtering is not based on current. Which is good. A view that is relative to a user namespace might be ok.It almost certainly does better as it's own little filesystem than as an extension to proc though. The big thing we want to ensure is that if you migrate you can restore everything. I don't see how you will be able to restore these files after migration. Anything like this without having a complete checkpoint/restore story is a non-starter. There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any problem here. An obvious diffference is that you are adding the inode to the inode to the file name. Which means that now you really do have to preserve the inode numbers during process migration. Which means now we have to do all of the work to make inode number restoration possible. Which means now we need to have multiple instances of nsfs so that we can restore inode numbers. I think this is still possible but we have been delaying figuring out how to restore inode numbers long enough that may be actual technical problems making it happen. Now maybe CRIU can handle the names of the files changing during migration but you have just increased the level of difficulty for doing that. Yes adding /proc/namespaces/:[] files may be a problem to CRIU. First I would like to highlight that open files are not a problem. Because open file from /proc/namespaces/* are exactly the same as open files from /proc//ns/. So when we c/r an nsfs open file fd on dump we readlink the fd and get :[] and on restore we recreate each dumped namespace and open an fd to each, so we can 'dup' it when restoring open file. It will be an fd to topologically same namespace though ns_ino would be newly generated. But the problem I see is with readdir. What if some task is reading /proc/namespaces/ directory at the time of dump, after restore directory will contain new names for namespaces and possibly in different order, this way if process continues to readdir it can miss some namespaces or read some twice. May be instead of multiple files in /proc/namespaces directory, we can leave just one file /proc/namespaces and when we open it we would return e.g. a unix socket filled with all the fds of all namespacess visible at this point. It looks like a possible solution to the above problem. CRIU can restore unix sockets with fds inside, so we should be able to dump process using this functionality. If you have a specific worries about, let's discuss them. I was asking and I am asking that it be described in the patch description how a container using this feature can be migrated from one machine to another. This code is so close to being problematic that we need be very careful we don't fundamentally break CRIU while trying to make it's job simpler and easier. CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R. Further by not going through the processes it looks like you are bypassing the existing permission checks. Which has the potential to allow someone to use a namespace who would not be able to otherwise. I agree, and I wrote to Christian, that permissions should be more strict. This just should be formalized. Let's discuss this. So I think this goes one step too far but I am willing to be persuaded otherwise. Eric -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH 0/2] overlayfs: C/R enhancements
On 6/5/20 5:35 AM, Amir Goldstein wrote: On Fri, Jun 5, 2020 at 12:34 AM Alexander Mikhalitsyn wrote: Hello, But overlayfs won't accept these "output only" options as input args, which is a problem. Will it be problematic if we simply ignore "lowerdir_mnt_id" and "upperdir_mnt_id" options in ovl_parse_opt()? That would solve this small problem. This is not a big problem actually as these options shown in mountinfo for overlay had been "output only" forever, please see these two examples below: a) Imagine you've mounted overlay with relative paths and forgot (or you never known as you are another user) where your cwd was at the moment of mount syscall. - How would you use those options as "input" to create the same overlay mount somethere else (bind-mounting not involved)? b) Imagine you've mounted overlay with absolute paths and someone (other user) overmounted lower (upper/workdir) paths for you, all directory structure would be the same on overmount but yet files are different. - How would you use those options from mountinfo as "input" ones? We try to make them much closer to "input" ones. Agreed, we should ignore *_mnt_id on mount because paths identify mounts at the time of mount call. Wouldn't it be better for C/R to implement mount options that overlayfs can parse and pass it mntid and fhandle instead of paths? >> Problem is that we need to know on C/R "dump stage" which mounts are used on lower layers and upper layer. Most likely I don't understand something but I can't catch how "mount-time" options will help us. As you already know from inotify/fanotify C/R fhandle is timeless, so there would be no distinction between mount time and dump time. Pair of fhandle+mnt_id looks an equivalent to path+mnt_id pair, CRIU will just need to open fhandle+mnt_id with open_by_handle_at and readlink to get path on dump and continue to use path+mnt_id as before. (not too common with fhandles but it's my current understanding) But if you take a look on (a) and (b) again, the regular user does not see full information about overlay mount in /proc/pid/mountinfo, they can't just take a look on it and understand from there it comes from. Resolving fhandle looks like a too hard task for a user. About mnt_id, your patches will cause the original mount-time mounts to be busy. That is a problem as well. Children mounts lock parent, open files lock parent. Another analogy is a loop device which locks the backing file mount (AFAICS). Anyway one can lazy umount, can't they? But I'm not too sure for this one, maybe you can share more implications of this problem? I think you should describe the use case is more details. Is your goal to C/R any overlayfs mount that the process has open files on? visible to process We wan't to dump a container, not a simple process, if the container process has access to some resource CRIU needs to restore this resource. Imagine the process in container mounts it's own overlay inside container, for instance to imulate write access to readonly mount or just to implement some snapshots, don't know exact use case. And we want to checkpoint/restore this container. (Currently CRIU only supports overlay as external mount, e.g. for docker continers docker engine pre-creates overlay for us and we just bind from it - it's a different case.) If the in-container process creates the in-container mount we need to recreate it on restore so that the in-container view of the filesystem persists. For NFS export, we use the persistent descriptor {uuid;fhandle} (a.k.a. struct ovl_fh) to encode an underlying layer object. CRIU can look for an existing mount to a filesystem with uuid as restore stage (or even mount this filesystem) and use open_by_handle_at() to open a path to layer. On restore we can be on another physical node, so I doubt we have same uuid's, sorry I don't fully understand here already. After mounting overlay, that mount to underlying fs can even be discarded. And if this works for you, you don't have to export the layers ovl_fh in /proc/mounts, you can export them in numerous other ways. One way from the top of my head, getxattr on overlay root dir. "trusted.overlay" xattr is anyway a reserved prefix, so "trusted.overlay.layers" for example could work. Thanks xattr might be a good option, but still don't forget about (a) and (b), users like to know all information about mount from /proc/pid/mountinfo. Thanks, Amir. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] sunrpc: fix crash when cache_head become valid before update
Add Neil to CC, sorry, had lost it somehow... On 10/1/19 11:03 AM, Pavel Tikhomirov wrote: > I was investigating a crash in our Virtuozzo7 kernel which happened in > in svcauth_unix_set_client. I found out that we access m_client field > in ip_map structure, which was received from sunrpc_cache_lookup (we > have a bit older kernel, now the code is in sunrpc_cache_add_entry), and > these field looks uninitialized (m_client == 0x74 don't look like a > pointer) but in the cache_head in flags we see 0x1 which is CACHE_VALID. > > It looks like the problem appeared from our previous fix to sunrpc (1): > commit 4ecd55ea0742 ("sunrpc: fix cache_head leak due to queued > request") > > And we've also found a patch already fixing our patch (2): > commit d58431eacb22 ("sunrpc: don't mark uninitialised items as VALID.") > > Though the crash is eliminated, I think the core of the problem is not > completely fixed: > > Neil in the patch (2) makes cache_head CACHE_NEGATIVE, before > cache_fresh_locked which was added in (1) to fix crash. These way > cache_is_valid won't say the cache is valid anymore and in > svcauth_unix_set_client the function cache_check will return error > instead of 0, and we don't count entry as initialized. > > But it looks like we need to remove cache_fresh_locked completely in > sunrpc_cache_lookup: > > In (1) we've only wanted to make cache_fresh_unlocked->cache_dequeue so > that cache_requests with no readers also release corresponding > cache_head, to fix their leak. We with Vasily were not sure if > cache_fresh_locked and cache_fresh_unlocked should be used in pair or > not, so we've guessed to use them in pair. > > Now we see that we don't want the CACHE_VALID bit set here by > cache_fresh_locked, as "valid" means "initialized" and there is no > initialization in sunrpc_cache_add_entry. Both expiry_time and > last_refresh are not used in cache_fresh_unlocked code-path and also not > required for the initial fix. > > So to conclude cache_fresh_locked was called by mistake, and we can just > safely remove it instead of crutching it with CACHE_NEGATIVE. It looks > ideologically better for me. Hope I don't miss something here. > > Here is our crash backtrace: > [13108726.326291] BUG: unable to handle kernel NULL pointer dereference at > 0074 > [13108726.326365] IP: [] > svcauth_unix_set_client+0x2ab/0x520 [sunrpc] > [13108726.326448] PGD 0 > [13108726.326468] Oops: 0002 [#1] SMP > [13108726.326497] Modules linked in: nbd isofs xfs loop > kpatch_cumulative_81_0_r1(O) xt_physdev nfnetlink_queue bluetooth rfkill > ip6table_nat nf_nat_ipv6 ip_vs_wrr ip_vs_wlc ip_vs_sh nf_conntrack_netlink > ip_vs_sed ip_vs_pe_sip nf_conntrack_sip ip_vs_nq ip_vs_lc ip_vs_lblcr > ip_vs_lblc ip_vs_ftp ip_vs_dh nf_nat_ftp nf_conntrack_ftp iptable_raw > xt_recent nf_log_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common xt_LOG xt_limit > xt_TCPMSS xt_tcpmss vxlan ip6_udp_tunnel udp_tunnel xt_statistic xt_NFLOG > nfnetlink_log dummy xt_mark xt_REDIRECT nf_nat_redirect raw_diag udp_diag > tcp_diag inet_diag netlink_diag af_packet_diag unix_diag rpcsec_gss_krb5 > xt_addrtype ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT > nf_reject_ipv6 ebtable_nat ebtable_broute nf_conntrack_ipv6 nf_defrag_ipv6 > ip6table_mangle ip6table_raw nfsv4 > [13108726.327173] dns_resolver cls_u32 binfmt_misc arptable_filter > arp_tables ip6table_filter ip6_tables devlink fuse_kio_pcs ipt_MASQUERADE > nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 xt_comment > nf_conntrack_ipv4 nf_defrag_ipv4 xt_wdog_tmo xt_multiport bonding xt_set > xt_conntrack iptable_filter iptable_mangle kpatch(O) ebtable_filter ebt_among > ebtables ip_set_hash_ip ip_set nfnetlink vfat fat skx_edac intel_powerclamp > coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass fuse pcspkr ses > enclosure joydev sg mei_me hpwdt hpilo lpc_ich mei ipmi_si shpchp > ipmi_devintf ipmi_msghandler xt_ipvs acpi_power_meter ip_vs_rr nfsv3 nfsd > auth_rpcgss nfs_acl nfs lockd grace fscache nf_nat cls_fw sch_htb sch_cbq > sch_sfq ip_vs em_u32 nf_conntrack tun br_netfilter veth overlay ip6_vzprivnet > ip6_vznetstat ip_vznetstat > [13108726.327817] ip_vzprivnet vziolimit vzevent vzlist vzstat vznetstat > vznetdev vzmon vzdev bridge pio_kaio pio_nfs pio_direct pfmt_raw pfmt_ploop1 > ploop ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 > i2c_algo_bit drm_kms_helper scsi_transport_iscsi 8021q syscopyarea > sysfillrect garp sysimgblt fb_sys_fops mrp stp ttm llc bnx2x crct10dif_pclmul > crct10dif_common crc32_pclmul crc32c_intel drm dm_multipath > ghash_clmulni_intel uas aesni_inte
[PATCH] sunrpc: fix crash when cache_head become valid before update
a0a6a41b1160 ti: a0c2a74bc000 task.ti: a0c2a74bc000 [13108726.328610] RIP: 0010:[] [] svcauth_unix_set_client+0x2ab/0x520 [sunrpc] [13108726.328706] RSP: 0018:a0c2a74bfd80 EFLAGS: 00010246 [13108726.328750] RAX: 0001 RBX: a0a6183ae000 RCX: [13108726.328811] RDX: 0074 RSI: 0286 RDI: a0c2a74bfcf0 [13108726.328864] RBP: a0c2a74bfe00 R08: a0bab8c22960 R09: 0001 [13108726.328916] R10: 0001 R11: 0001 R12: a0a32aa7f000 [13108726.328969] R13: a0a6183afac0 R14: a0c233d88d00 R15: a0c2a74bfdb4 [13108726.329022] FS: () GS:a0e17f9c() knlGS: [13108726.329081] CS: 0010 DS: ES: CR0: 80050033 [13108726.332311] CR2: 0074 CR3: 0026a1b28000 CR4: 007607e0 [13108726.334606] DR0: DR1: DR2: [13108726.336754] DR3: DR6: fffe0ff0 DR7: 0400 [13108726.338908] PKRU: [13108726.341047] Call Trace: [13108726.343074] [] ? groups_alloc+0x34/0x110 [13108726.344837] [] svc_set_client+0x24/0x30 [sunrpc] [13108726.346631] [] svc_process_common+0x241/0x710 [sunrpc] [13108726.348332] [] svc_process+0x103/0x190 [sunrpc] [13108726.350016] [] nfsd+0xdf/0x150 [nfsd] [13108726.351735] [] ? nfsd_destroy+0x80/0x80 [nfsd] [13108726.353459] [] kthread+0xd1/0xe0 [13108726.355195] [] ? create_kthread+0x60/0x60 [13108726.356896] [] ret_from_fork_nospec_begin+0x7/0x21 [13108726.358577] [] ? create_kthread+0x60/0x60 [13108726.360240] Code: 4c 8b 45 98 0f 8e 2e 01 00 00 83 f8 fe 0f 84 76 fe ff ff 85 c0 0f 85 2b 01 00 00 49 8b 50 40 b8 01 00 00 00 48 89 93 d0 1a 00 00 0f c1 02 83 c0 01 83 f8 01 0f 8e 53 02 00 00 49 8b 44 24 38 [13108726.363769] RIP [] svcauth_unix_set_client+0x2ab/0x520 [sunrpc] [13108726.365530] RSP [13108726.367179] CR2: 0074 Fixes: d58431eacb22 ("sunrpc: don't mark uninitialised items as VALID.") Signed-off-by: Pavel Tikhomirov --- net/sunrpc/cache.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c index a349094f6fb7..f740cb51802a 100644 --- a/net/sunrpc/cache.c +++ b/net/sunrpc/cache.c @@ -53,9 +53,6 @@ static void cache_init(struct cache_head *h, struct cache_detail *detail) h->last_refresh = now; } -static inline int cache_is_valid(struct cache_head *h); -static void cache_fresh_locked(struct cache_head *head, time_t expiry, - struct cache_detail *detail); static void cache_fresh_unlocked(struct cache_head *head, struct cache_detail *detail); @@ -105,9 +102,6 @@ static struct cache_head *sunrpc_cache_add_entry(struct cache_detail *detail, if (cache_is_expired(detail, tmp)) { hlist_del_init_rcu(&tmp->cache_list); detail->entries --; - if (cache_is_valid(tmp) == -EAGAIN) - set_bit(CACHE_NEGATIVE, &tmp->flags); - cache_fresh_locked(tmp, 0, detail); freeme = tmp; break; } -- 2.21.0
mm/swap: possible problem introduced when replacing REQ_NOIDLE with REQ_IDLE
Hi, all. Then porting patches from mainstream I've found some strange code: > commit a2b809672ee6fcb4d5756ea815725b3dbaea654e > Author: Christoph Hellwig > Date: Tue Nov 1 07:40:09 2016 -0600 > > block: replace REQ_NOIDLE with REQ_IDLE > > Noidle should be the default for writes as seen by all the compounds > definitions in fs.h using it. In fact only direct I/O really should > be using NODILE, so turn the whole flag around to get the defaults > right, which will make our life much easier especially onces the > WRITE_* defines go away. > > This assumes all the existing "raw" users of REQ_SYNC for writes > want noidle behavior, which seems to be spot on from a quick audit. > > Signed-off-by: Christoph Hellwig > Signed-off-by: Jens Axboe > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index ccedccb28ec8..46a74209917f 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -197,11 +197,11 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, > #define WRITE REQ_OP_WRITE > > #define READ_SYNC 0 > -#define WRITE_SYNC (REQ_SYNC | REQ_NOIDLE) > -#define WRITE_ODIRECT REQ_SYNC > -#define WRITE_FLUSH(REQ_NOIDLE | REQ_PREFLUSH) > -#define WRITE_FUA (REQ_NOIDLE | REQ_FUA) > -#define WRITE_FLUSH_FUA(REQ_NOIDLE | REQ_PREFLUSH | REQ_FUA) > +#define WRITE_SYNC REQ_SYNC > +#define WRITE_ODIRECT (REQ_SYNC | REQ_IDLE) > +#define WRITE_FLUSHREQ_PREFLUSH > +#define WRITE_FUA REQ_FUA > +#define WRITE_FLUSH_FUA(REQ_PREFLUSH | REQ_FUA) > > /* > * Attribute flags. These should be or-ed together to figure out what The above commit changes the meaning of the REQ_SYNC flag, before the patch it was equal to WRITE_ODIRECT and after the patch it is equal to WRITE_SYNC. And thus I think it became treated differently (I see only one place left in wbt_should_throttle.). But in __swap_writepage() both before and after the mentioned patch we still pass a single REQ_SYNC without any REQ_IDLE/REQ_UNIDLE: > [snorch@snorch linux]$ git blame a2b809672ee6fcb4d5756ea815725b3dbaea654e^ mm/page_io.c | grep -a5 REQ_SYNC > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 319) unlock_page(page); > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 320) ret = -ENOMEM; > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 321) goto out; > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 322) } > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 323) if (wbc->sync_mode == WB_SYNC_ALL) > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 324) bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC); > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 325) else > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 326) bio_set_op_attrs(bio, REQ_OP_WRITE, 0); > f8891e5e1f93a (Christoph Lameter 2006-06-30 01:55:45 -0700 327) count_vm_event(PSWPOUT); > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 328) set_page_writeback(page); > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 329) unlock_page(page); > [snorch@snorch linux]$ git blame a2b809672ee6fcb4d5756ea815725b3dbaea654e mm/page_io.c | grep -a5 REQ_SYNC > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 319) unlock_page(page); > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 320) ret = -ENOMEM; > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 321) goto out; > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 322) } > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 323) if (wbc->sync_mode == WB_SYNC_ALL) > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 324) bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC); > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 325) else > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 326) bio_set_op_attrs(bio, REQ_OP_WRITE, 0); > f8891e5e1f93a (Christoph Lameter 2006-06-30 01:55:45 -0700 327) count_vm_event(PSWPOUT); > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 328) set_page_writeback(page); > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 329) unlock_page(page); It looks like we've changed the way how we handle swap page writes from "odirect" way to "regular" sync write way, these can be wrong. This may also affect deprecated cfq io-scheduler on older kernels. Thanks in advance for any advice on what to do with these, may be I miss something. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
[PATCH] ext4: remove unnecessary gotos in ext4_xattr_set_entry
In the "out" label we only iput old/new_ea_inode-s, in all these places these variables are always NULL so there is no point in goto to "out". Signed-off-by: Pavel Tikhomirov --- fs/ext4/xattr.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 491f9ee4040e..ac2ddd4446b3 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -1601,8 +1601,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i, next = EXT4_XATTR_NEXT(last); if ((void *)next >= s->end) { EXT4_ERROR_INODE(inode, "corrupted xattr entries"); - ret = -EFSCORRUPTED; - goto out; + return -EFSCORRUPTED; } if (!last->e_value_inum && last->e_value_size) { size_t offs = le16_to_cpu(last->e_value_offs); @@ -1620,8 +1619,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i, free += EXT4_XATTR_LEN(name_len) + old_size; if (free < EXT4_XATTR_LEN(name_len) + new_size) { - ret = -ENOSPC; - goto out; + return -ENOSPC; } /* @@ -1634,8 +1632,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i, new_size && is_block && (min_offs + old_size - new_size) < EXT4_XATTR_BLOCK_RESERVE(inode)) { - ret = -ENOSPC; - goto out; + return -ENOSPC; } } -- 2.20.1
Re: [PATCH] tracing: Fix event filters and triggers to handle negative numbers
ping, looks like the patch was lost On 8/24/18 3:48 AM, Steven Rostedt wrote: > On Thu, 23 Aug 2018 13:25:34 +0300 > Pavel Tikhomirov wrote: > >> Then tracing syscall exit event it is extremely useful to filter exit >> codes equal to some negative value, to react only to required errors. >> But negative numbers does not work: >> >> [root@snorch sys_exit_read]# echo "ret == -1" > filter >> bash: echo: write error: Invalid argument >> [root@snorch sys_exit_read]# cat filter >> ret == -1 >> ^ >> parse_error: Invalid value (did you forget quotes)? > > Thanks for the patch. I'll apply it and then start testing it! > > -- Steve > >> >> Similar thing happens when setting triggers. >> >> These is a regression in v4.17 introduced by the commit mentioned below, >> testing without these commit shows no problem with negative numbers. >> >> Fixes: 80765597bc58 ("tracing: Rewrite filter logic to be simpler and >> faster") >> Signed-off-by: Pavel Tikhomirov >> --- >> kernel/trace/trace_events_filter.c | 5 - >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/trace/trace_events_filter.c >> b/kernel/trace/trace_events_filter.c >> index 84a65173b1e9..2ba449292561 100644 >> --- a/kernel/trace/trace_events_filter.c >> +++ b/kernel/trace/trace_events_filter.c >> @@ -1299,7 +1299,7 @@ static int parse_pred(const char *str, void *data, >> /* go past the last quote */ >> i++; >> >> -} else if (isdigit(str[i])) { >> +} else if (isdigit(str[i]) || str[i] == '-') { >> >> /* Make sure the field is not a string */ >> if (is_string_field(field)) { >> @@ -1312,6 +1312,9 @@ static int parse_pred(const char *str, void *data, >> goto err_free; >> } >> >> +if (str[i] == '-') >> +i++; >> + >> /* We allow 0xDEADBEEF */ >> while (isalnum(str[i])) >> i++; > -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] vhost: return EINVAL if iovecs size does not match the message size
On 12/13/2018 10:55 PM, Michael S. Tsirkin wrote: > On Thu, Dec 13, 2018 at 05:53:50PM +0300, Pavel Tikhomirov wrote: >> We've failed to copy and process vhost_iotlb_msg so let userspace at >> least know about it. For instance before these patch the code below runs >> without any error: >> >> int main() >> { >>struct vhost_msg msg; >>struct iovec iov; >>int fd; >> >>fd = open("/dev/vhost-net", O_RDWR); >>if (fd == -1) { >> perror("open"); >> return 1; >>} >> >>iov.iov_base = &msg; >>iov.iov_len = sizeof(msg)-4; >> >>if (writev(fd, &iov,1) == -1) { >> perror("writev"); >> return 1; >>} >> >>return 0; >> } >> >> Signed-off-by: Pavel Tikhomirov > > Thanks for the patch! > >> --- >> drivers/vhost/vhost.c | 8 ++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index 3a5f81a66d34..03014224ef13 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -1024,8 +1024,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, >> int type, ret; >> >> ret = copy_from_iter(&type, sizeof(type), from); >> -if (ret != sizeof(type)) >> +if (ret != sizeof(type)) { >> +ret = -EINVAL; >> goto done; >> +} >> >> switch (type) { >> case VHOST_IOTLB_MSG: > > should this be EFAULT rather? We already have "Invalid argument" returned when wrong type of vhost_msg received, I though it would be fine to return same thing if we have wrong size of vhost_msg. When we return "Bad address" because of vhost_process_iotlb_msg fail, it is because our vhost_dev has no ->iotlb so our problem is not connected to the data passed from userspace but with the state of vhost_dev. So I like EINVAL more in these two cases. > >> @@ -1044,8 +1046,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, >> >> iov_iter_advance(from, offset); >> ret = copy_from_iter(&msg, sizeof(msg), from); >> -if (ret != sizeof(msg)) >> +if (ret != sizeof(msg)) { >> +ret = -EINVAL; >> goto done; >> +} >> if (vhost_process_iotlb_msg(dev, &msg)) { >> ret = -EFAULT; >> goto done; > > This too? > >> -- >> 2.17.1 -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
[PATCH] vhost: return EINVAL if iovecs size does not match the message size
We've failed to copy and process vhost_iotlb_msg so let userspace at least know about it. For instance before these patch the code below runs without any error: int main() { struct vhost_msg msg; struct iovec iov; int fd; fd = open("/dev/vhost-net", O_RDWR); if (fd == -1) { perror("open"); return 1; } iov.iov_base = &msg; iov.iov_len = sizeof(msg)-4; if (writev(fd, &iov,1) == -1) { perror("writev"); return 1; } return 0; } Signed-off-by: Pavel Tikhomirov --- drivers/vhost/vhost.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 3a5f81a66d34..03014224ef13 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1024,8 +1024,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, int type, ret; ret = copy_from_iter(&type, sizeof(type), from); - if (ret != sizeof(type)) + if (ret != sizeof(type)) { + ret = -EINVAL; goto done; + } switch (type) { case VHOST_IOTLB_MSG: @@ -1044,8 +1046,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, iov_iter_advance(from, offset); ret = copy_from_iter(&msg, sizeof(msg), from); - if (ret != sizeof(msg)) + if (ret != sizeof(msg)) { + ret = -EINVAL; goto done; + } if (vhost_process_iotlb_msg(dev, &msg)) { ret = -EFAULT; goto done; -- 2.17.1
[PATCH] nfs: fix comment to nfs_generic_pg_test which does the opposite
Please see comment to filelayout_pg_test for reference. To: Trond Myklebust Cc: Anna Schumaker Cc: linux-...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov --- fs/nfs/pagelist.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c index 5c4568a0804b..87f3da1fd850 100644 --- a/fs/nfs/pagelist.c +++ b/fs/nfs/pagelist.c @@ -461,7 +461,7 @@ EXPORT_SYMBOL_GPL(nfs_wait_on_request); * @prev: previous request in desc, or NULL * @req: this request * - * Returns zero if @req can be coalesced into @desc, otherwise it returns + * Returns zero if @req cannot be coalesced into @desc, otherwise it returns * the size of the request. */ size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc, -- 2.17.1
[PATCH v2] mm: cleancache: fix corruption on missed inode invalidation
If all pages are deleted from the mapping by memory reclaim and also moved to the cleancache: __delete_from_page_cache (no shadow case) unaccount_page_cache_page cleancache_put_page page_cache_delete mapping->nrpages -= nr (nrpages becomes 0) We don't clean the cleancache for an inode after final file truncation (removal). truncate_inode_pages_final check (nrpages || nrexceptional) is false no truncate_inode_pages no cleancache_invalidate_inode(mapping) These way when reading the new file created with same inode we may get these trash leftover pages from cleancache and see wrong data instead of the contents of the new file. Fix it by always doing truncate_inode_pages which is already ready for nrpages == 0 && nrexceptional == 0 case and just invalidates inode. v2: add comment Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache") To: Andrew Morton Cc: Johannes Weiner Cc: Mel Gorman Cc: Jan Kara Cc: Matthew Wilcox Cc: Andi Kleen Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Vasily Averin Reviewed-by: Andrey Ryabinin Reviewed-by: Jan Kara Signed-off-by: Pavel Tikhomirov --- mm/truncate.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/truncate.c b/mm/truncate.c index 45d68e90b703..2c5285767ce5 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -517,9 +517,14 @@ void truncate_inode_pages_final(struct address_space *mapping) */ xa_lock_irq(&mapping->i_pages); xa_unlock_irq(&mapping->i_pages); - - truncate_inode_pages(mapping, 0); } + + /* +* Cleancache needs notification even if there are no pages or shadow +* entries, else we will leave leftover pages in the cleancache for +* a deleted inode. +*/ + truncate_inode_pages(mapping, 0); } EXPORT_SYMBOL(truncate_inode_pages_final); -- 2.17.1
[PATCH] mm: cleancache: fix corruption on missed inode invalidation
If all pages are deleted from the mapping by memory reclaim and also moved to the cleancache: __delete_from_page_cache (no shadow case) unaccount_page_cache_page cleancache_put_page page_cache_delete mapping->nrpages -= nr (nrpages becomes 0) We don't clean the cleancache for an inode after final file truncation (removal). truncate_inode_pages_final check (nrpages || nrexceptional) is false no truncate_inode_pages no cleancache_invalidate_inode(mapping) These way when reading the new file created with same inode we may get these trash leftover pages from cleancache and see wrong data instead of the contents of the new file. Fix it by always doing truncate_inode_pages which is already ready for nrpages == 0 && nrexceptional == 0 case and just invalidates inode. Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache") To: Andrew Morton Cc: Johannes Weiner Cc: Mel Gorman Cc: Jan Kara Cc: Matthew Wilcox Cc: Andi Kleen Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Vasily Averin Reviewed-by: Andrey Ryabinin Signed-off-by: Pavel Tikhomirov --- mm/truncate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/truncate.c b/mm/truncate.c index 45d68e90b703..4c56c19e76eb 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -517,9 +517,9 @@ void truncate_inode_pages_final(struct address_space *mapping) */ xa_lock_irq(&mapping->i_pages); xa_unlock_irq(&mapping->i_pages); - - truncate_inode_pages(mapping, 0); } + + truncate_inode_pages(mapping, 0); } EXPORT_SYMBOL(truncate_inode_pages_final); -- 2.17.1
[PATCH] tracing: Fix event filters and triggers to handle negative numbers
Then tracing syscall exit event it is extremely useful to filter exit codes equal to some negative value, to react only to required errors. But negative numbers does not work: [root@snorch sys_exit_read]# echo "ret == -1" > filter bash: echo: write error: Invalid argument [root@snorch sys_exit_read]# cat filter ret == -1 ^ parse_error: Invalid value (did you forget quotes)? Similar thing happens when setting triggers. These is a regression in v4.17 introduced by the commit mentioned below, testing without these commit shows no problem with negative numbers. Fixes: 80765597bc58 ("tracing: Rewrite filter logic to be simpler and faster") Signed-off-by: Pavel Tikhomirov --- kernel/trace/trace_events_filter.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 84a65173b1e9..2ba449292561 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1299,7 +1299,7 @@ static int parse_pred(const char *str, void *data, /* go past the last quote */ i++; - } else if (isdigit(str[i])) { + } else if (isdigit(str[i]) || str[i] == '-') { /* Make sure the field is not a string */ if (is_string_field(field)) { @@ -1312,6 +1312,9 @@ static int parse_pred(const char *str, void *data, goto err_free; } + if (str[i] == '-') + i++; + /* We allow 0xDEADBEEF */ while (isalnum(str[i])) i++; -- 2.17.1
Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy
Great news, that it works for you! Thanks a lot! Pavel On 11/22/2017 03:49 AM, Stuart Hayes wrote: My apologies... yes, your patch also fixes my issue. I was looking at the two new places from which you were calling scsi_eh_wakeup(), and didn't notice that you moved the spinlock in scsi_device_unbusy()... moving the spinlock in scsi_device_unbusy() also should the issue I'm seeing, given that scsi_eh_scmd_add() also uses the spinlock. I tested your patch on my issue, and it did indeed fix my issue. So you can add... Tested-by: Stuart Hayes Thanks Stuart On 11/21/2017 2:09 AM, Pavel Tikhomirov wrote: My patch should also fix your issue too, please see explanation in reply to your patch. Do your testing show that it doesn't? Thanks, Pavel. On 11/21/2017 09:10 AM, Stuart Hayes wrote: Pavel, It turns out that the error handler on our systems was not getting woken up for a different reason... I submitted a patch earlier today that fixes the issue I were seeing (I CCed you on the patch). Before I got my hands on the failing system and was able to root cause it, I was pretty sure that your patch was going to fix our issue, because after I examined the code paths, I couldn't find any other reason that the error handler would not get woken up. I tried forcing the bug that your patch fixes to occur, by compiling in some mdelay()s in a key place or two in the scsi code, but it never failed for me that way. With my patch, several systems that previously failed in 10 minutes or less successfully ran for many days. Thanks, Stuart On 11/9/2017 8:54 AM, Pavel Tikhomirov wrote: Are there any issues with this patch (https://patchwork.kernel.org/patch/9938919/) that Pavel Tikhomirov submitted back in September? I am willing to help if there's anything I can do to help get it accepted. Hi, Stuart, I asked James Bottomley about the patch status offlist and it seems that the problem is - patch lacks testing and review. I would highly appreciate review from your side and anyone who wants to participate! And if you can confirm that the patch solves the problem on your environment with no side effects please add "Tested-by:" tag also. Thanks, Pavel On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote: We have a problem on several our nodes with scsi EH. Imagine such an order of execution of two threads: CPU1 scsi_eh_scmd_add CPU2 scsi_host_queue_ready /* shost->host_busy == 1 initialy */ if (shost->shost_state == SHOST_RECOVERY) /* does not get here */ return 0; lock(shost->host_lock); shost->shost_state = SHOST_RECOVERY; busy = shost->host_busy++; /* host->can_queue == 1 initialy, busy == 1 * - go to starved label */ lock(shost->host_lock) /* wait */ shost->host_failed++; /* shost->host_busy == 2, shost->host_failed == 1 */ call scsi_eh_wakeup(shost) { if (host_busy == host_failed) { /* does not get here */ wake_up_process(shost->ehandler) } } unlock(shost->host_lock) /* acquire lock */ shost->host_busy--; Finaly we do not wakeup scsi_error_handler and all other commands coming will hang as we are in never ending recovery state as there is no one left to wakeup handler. So scsi disc in these host becomes unresponsive and all bio on node hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.) Main idea of the fix is to try to do wake up every time we decrement host_busy or increment host_failed(the latter is already OK). Now the very *last* one of busy threads getting host_lock after decrementing host_busy will see all write operations on host's shost_state, host_busy and host_failed completed thanks to implied memory barriers on spin_lock/unlock, so at the time of busy==failed we will trigger wakeup in at least one thread. (Thats why putting recovery and failed checks under lock) Signed-off-by: Pavel Tikhomirov --- drivers/scsi/scsi_lib.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f6097b89d5d3..6c99221d60aa 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); + spin_lock_irqsave(shost->host_lock, flags); if (unlikely(scsi_host_in_recovery(shost) && - (shost->host_failed || shost->host_eh_scheduled))) { - spin_lock_irqsave(shost->host_lock, flags); + (shost->host_failed || shost->host_eh_scheduled))) scsi_eh_wakeup(shost); - spin_unlock_irqrestore(shost->host_lock, flags); - } + spin_unlock_
Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy
JFYI these patch is in Virtuozzo7 kernel from September, and we have no issues found with it until now by out testing, and initial problem does not reproduce for 2.5 months.
Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy
My patch should also fix your issue too, please see explanation in reply to your patch. Do your testing show that it doesn't? Thanks, Pavel. On 11/21/2017 09:10 AM, Stuart Hayes wrote: Pavel, It turns out that the error handler on our systems was not getting woken up for a different reason... I submitted a patch earlier today that fixes the issue I were seeing (I CCed you on the patch). Before I got my hands on the failing system and was able to root cause it, I was pretty sure that your patch was going to fix our issue, because after I examined the code paths, I couldn't find any other reason that the error handler would not get woken up. I tried forcing the bug that your patch fixes to occur, by compiling in some mdelay()s in a key place or two in the scsi code, but it never failed for me that way. With my patch, several systems that previously failed in 10 minutes or less successfully ran for many days. Thanks, Stuart On 11/9/2017 8:54 AM, Pavel Tikhomirov wrote: Are there any issues with this patch (https://patchwork.kernel.org/patch/9938919/) that Pavel Tikhomirov submitted back in September? I am willing to help if there's anything I can do to help get it accepted. Hi, Stuart, I asked James Bottomley about the patch status offlist and it seems that the problem is - patch lacks testing and review. I would highly appreciate review from your side and anyone who wants to participate! And if you can confirm that the patch solves the problem on your environment with no side effects please add "Tested-by:" tag also. Thanks, Pavel On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote: We have a problem on several our nodes with scsi EH. Imagine such an order of execution of two threads: CPU1 scsi_eh_scmd_add CPU2 scsi_host_queue_ready /* shost->host_busy == 1 initialy */ if (shost->shost_state == SHOST_RECOVERY) /* does not get here */ return 0; lock(shost->host_lock); shost->shost_state = SHOST_RECOVERY; busy = shost->host_busy++; /* host->can_queue == 1 initialy, busy == 1 * - go to starved label */ lock(shost->host_lock) /* wait */ shost->host_failed++; /* shost->host_busy == 2, shost->host_failed == 1 */ call scsi_eh_wakeup(shost) { if (host_busy == host_failed) { /* does not get here */ wake_up_process(shost->ehandler) } } unlock(shost->host_lock) /* acquire lock */ shost->host_busy--; Finaly we do not wakeup scsi_error_handler and all other commands coming will hang as we are in never ending recovery state as there is no one left to wakeup handler. So scsi disc in these host becomes unresponsive and all bio on node hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.) Main idea of the fix is to try to do wake up every time we decrement host_busy or increment host_failed(the latter is already OK). Now the very *last* one of busy threads getting host_lock after decrementing host_busy will see all write operations on host's shost_state, host_busy and host_failed completed thanks to implied memory barriers on spin_lock/unlock, so at the time of busy==failed we will trigger wakeup in at least one thread. (Thats why putting recovery and failed checks under lock) Signed-off-by: Pavel Tikhomirov --- drivers/scsi/scsi_lib.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f6097b89d5d3..6c99221d60aa 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); + spin_lock_irqsave(shost->host_lock, flags); if (unlikely(scsi_host_in_recovery(shost) && - (shost->host_failed || shost->host_eh_scheduled))) { - spin_lock_irqsave(shost->host_lock, flags); + (shost->host_failed || shost->host_eh_scheduled))) scsi_eh_wakeup(shost); - spin_unlock_irqrestore(shost->host_lock, flags); - } + spin_unlock_irqrestore(shost->host_lock, flags); atomic_dec(&sdev->device_busy); } @@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct request_queue *q, spin_unlock_irq(shost->host_lock); out_dec: atomic_dec(&shost->host_busy); + + spin_lock_irq(shost->host_lock); + if (unlikely(scsi_host_in_recovery(shost) && + (shost->host_failed || shost->host_eh_scheduled))) + scsi_eh_wakeup(shost); + spin_unlock_irq(shost->host_lock); + return 0; } @@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hc
Re: [PATCH] scsi_error: ensure EH wakes up on error to prevent host getting stuck
On 11/20/2017 10:11 PM, Stuart Hayes wrote: When a command is added to the host's error handler command queue, there is a chance that the error handler will not be woken up. This can happen when one CPU is running scsi_eh_scmd_add() at the same time as another CPU is running scsi_device_unbusy() for a different command on the same host. Each function changes one value, and then looks at the value of a variable that the other function has just changed, but if they both see stale data, neither will actually wake up the error handle > In scsi_eh_scmd_add, host_failed is incremented, then scsi_eh_wakeup() is called, which sees that host_busy is still 2, so it doesn't actually wake up the handler. Meanwhile, in scsi_device_unbusy(), host_busy is decremented, and then it sees that host_failed is 0, so it doesn't even call scsi_eh_wakeup(). If in scsi_eh_scmd_add() we call scsi_eh_wakeup() it is done under spinlock, so we have implied memory barrier here. All stores which we've done before we had released the lock will be seen by any other thread in same critical section if the thread took spinlock after us. So later scsi_device_unbusy() in it's scsi_eh_wakeup() also sees host_failed==1. Actually the problem is that in scsi_device_unbusy() the check below: if (unlikely(scsi_host_in_recovery(shost) && (shost->host_failed || shost->host_eh_scheduled)) is not under same spinlock, so that host_failed can be actually be 0 at these point and we never get to scsi_eh_wakeup, which my patch "scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy" also fixes by putting these check under proper lock and thus the implied barrier is added and we don't need actual barrier here. Please see "LOCK ACQUISITION FUNCTIONS" Documentation/memory-barriers.txt for further information about implied memory barriers. Signed-off-by: Stuart Hyaes --- diff -pur linux-4.14/drivers/scsi/scsi_error.c linux-4.14-stu/drivers/scsi/scsi_error.c --- linux-4.14/drivers/scsi/scsi_error.c2017-11-12 12:46:13.0 -0600 +++ linux-4.14-stu/drivers/scsi/scsi_error.c2017-11-17 14:22:19.230867923 -0600 @@ -243,6 +243,10 @@ void scsi_eh_scmd_add(struct scsi_cmnd * scsi_eh_reset(scmd); list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q); shost->host_failed++; + /* +* See scsi_device_unbusy() for explanation of smp_mb(). +*/ + smp_mb(); scsi_eh_wakeup(shost); spin_unlock_irqrestore(shost->host_lock, flags); } diff -pur linux-4.14/drivers/scsi/scsi_lib.c linux-4.14-stu/drivers/scsi/scsi_lib.c --- linux-4.14/drivers/scsi/scsi_lib.c 2017-11-12 12:46:13.0 -0600 +++ linux-4.14-stu/drivers/scsi/scsi_lib.c 2017-11-17 14:22:15.814867833 -0600 @@ -325,6 +325,15 @@ void scsi_device_unbusy(struct scsi_devi unsigned long flags; atomic_dec(&shost->host_busy); + + /* This function changes host_busy and looks at host_failed, while +* scsi_eh_scmd_add() updates host_failed and looks at host_busy (in +* scsi_eh_wakeup())... if these happen simultaneously without the smp +* memory barrier, each can see the old value, such that neither will +* wake up the error handler, which can cause the host controller to +* be hung forever. +*/ + smp_mb(); if (starget->can_queue > 0) atomic_dec(&starget->target_busy); --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy
> Are there any issues with this patch (https://patchwork.kernel.org/patch/9938919/) that Pavel Tikhomirov submitted back in September? I am willing to help if there's anything I can do to help get it accepted. Hi, Stuart, I asked James Bottomley about the patch status offlist and it seems that the problem is - patch lacks testing and review. I would highly appreciate review from your side and anyone who wants to participate! And if you can confirm that the patch solves the problem on your environment with no side effects please add "Tested-by:" tag also. Thanks, Pavel On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote: We have a problem on several our nodes with scsi EH. Imagine such an order of execution of two threads: CPU1 scsi_eh_scmd_add CPU2 scsi_host_queue_ready /* shost->host_busy == 1 initialy */ if (shost->shost_state == SHOST_RECOVERY) /* does not get here */ return 0; lock(shost->host_lock); shost->shost_state = SHOST_RECOVERY; busy = shost->host_busy++; /* host->can_queue == 1 initialy, busy == 1 * - go to starved label */ lock(shost->host_lock) /* wait */ shost->host_failed++; /* shost->host_busy == 2, shost->host_failed == 1 */ call scsi_eh_wakeup(shost) { if (host_busy == host_failed) { /* does not get here */ wake_up_process(shost->ehandler) } } unlock(shost->host_lock) /* acquire lock */ shost->host_busy--; Finaly we do not wakeup scsi_error_handler and all other commands coming will hang as we are in never ending recovery state as there is no one left to wakeup handler. So scsi disc in these host becomes unresponsive and all bio on node hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.) Main idea of the fix is to try to do wake up every time we decrement host_busy or increment host_failed(the latter is already OK). Now the very *last* one of busy threads getting host_lock after decrementing host_busy will see all write operations on host's shost_state, host_busy and host_failed completed thanks to implied memory barriers on spin_lock/unlock, so at the time of busy==failed we will trigger wakeup in at least one thread. (Thats why putting recovery and failed checks under lock) Signed-off-by: Pavel Tikhomirov --- drivers/scsi/scsi_lib.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f6097b89d5d3..6c99221d60aa 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); + spin_lock_irqsave(shost->host_lock, flags); if (unlikely(scsi_host_in_recovery(shost) && -(shost->host_failed || shost->host_eh_scheduled))) { - spin_lock_irqsave(shost->host_lock, flags); +(shost->host_failed || shost->host_eh_scheduled))) scsi_eh_wakeup(shost); - spin_unlock_irqrestore(shost->host_lock, flags); - } + spin_unlock_irqrestore(shost->host_lock, flags); atomic_dec(&sdev->device_busy); } @@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct request_queue *q, spin_unlock_irq(shost->host_lock); out_dec: atomic_dec(&shost->host_busy); + + spin_lock_irq(shost->host_lock); + if (unlikely(scsi_host_in_recovery(shost) && +(shost->host_failed || shost->host_eh_scheduled))) + scsi_eh_wakeup(shost); + spin_unlock_irq(shost->host_lock); + return 0; } @@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, out_dec_host_busy: atomic_dec(&shost->host_busy); + + spin_lock_irq(shost->host_lock); + if (unlikely(scsi_host_in_recovery(shost) && +(shost->host_failed || shost->host_eh_scheduled))) + scsi_eh_wakeup(shost); + spin_unlock_irq(shost->host_lock); + out_dec_target_busy: if (scsi_target(sdev)->can_queue > 0) atomic_dec(&scsi_target(sdev)->target_busy); -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy
ping On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote: We have a problem on several our nodes with scsi EH. Imagine such an order of execution of two threads: CPU1 scsi_eh_scmd_add CPU2 scsi_host_queue_ready /* shost->host_busy == 1 initialy */ if (shost->shost_state == SHOST_RECOVERY) /* does not get here */ return 0; lock(shost->host_lock); shost->shost_state = SHOST_RECOVERY; busy = shost->host_busy++; /* host->can_queue == 1 initialy, busy == 1 * - go to starved label */ lock(shost->host_lock) /* wait */ shost->host_failed++; /* shost->host_busy == 2, shost->host_failed == 1 */ call scsi_eh_wakeup(shost) { if (host_busy == host_failed) { /* does not get here */ wake_up_process(shost->ehandler) } } unlock(shost->host_lock) /* acquire lock */ shost->host_busy--; Finaly we do not wakeup scsi_error_handler and all other commands coming will hang as we are in never ending recovery state as there is no one left to wakeup handler. So scsi disc in these host becomes unresponsive and all bio on node hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.) Main idea of the fix is to try to do wake up every time we decrement host_busy or increment host_failed(the latter is already OK). Now the very *last* one of busy threads getting host_lock after decrementing host_busy will see all write operations on host's shost_state, host_busy and host_failed completed thanks to implied memory barriers on spin_lock/unlock, so at the time of busy==failed we will trigger wakeup in at least one thread. (Thats why putting recovery and failed checks under lock) Signed-off-by: Pavel Tikhomirov --- drivers/scsi/scsi_lib.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f6097b89d5d3..6c99221d60aa 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); + spin_lock_irqsave(shost->host_lock, flags); if (unlikely(scsi_host_in_recovery(shost) && -(shost->host_failed || shost->host_eh_scheduled))) { - spin_lock_irqsave(shost->host_lock, flags); +(shost->host_failed || shost->host_eh_scheduled))) scsi_eh_wakeup(shost); - spin_unlock_irqrestore(shost->host_lock, flags); - } + spin_unlock_irqrestore(shost->host_lock, flags); atomic_dec(&sdev->device_busy); } @@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct request_queue *q, spin_unlock_irq(shost->host_lock); out_dec: atomic_dec(&shost->host_busy); + + spin_lock_irq(shost->host_lock); + if (unlikely(scsi_host_in_recovery(shost) && +(shost->host_failed || shost->host_eh_scheduled))) + scsi_eh_wakeup(shost); + spin_unlock_irq(shost->host_lock); + return 0; } @@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, out_dec_host_busy: atomic_dec(&shost->host_busy); + + spin_lock_irq(shost->host_lock); + if (unlikely(scsi_host_in_recovery(shost) && +(shost->host_failed || shost->host_eh_scheduled))) + scsi_eh_wakeup(shost); + spin_unlock_irq(shost->host_lock); + out_dec_target_busy: if (scsi_target(sdev)->can_queue > 0) atomic_dec(&scsi_target(sdev)->target_busy); -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy
Hi. Please tell if there is something I can do to help the patch get processed? It is on the list without reply for almost a month. On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote: We have a problem on several our nodes with scsi EH. Imagine such an order of execution of two threads: CPU1 scsi_eh_scmd_add CPU2 scsi_host_queue_ready /* shost->host_busy == 1 initialy */ if (shost->shost_state == SHOST_RECOVERY) /* does not get here */ return 0; lock(shost->host_lock); shost->shost_state = SHOST_RECOVERY; busy = shost->host_busy++; /* host->can_queue == 1 initialy, busy == 1 * - go to starved label */ lock(shost->host_lock) /* wait */ shost->host_failed++; /* shost->host_busy == 2, shost->host_failed == 1 */ call scsi_eh_wakeup(shost) { if (host_busy == host_failed) { /* does not get here */ wake_up_process(shost->ehandler) } } unlock(shost->host_lock) /* acquire lock */ shost->host_busy--; Finaly we do not wakeup scsi_error_handler and all other commands coming will hang as we are in never ending recovery state as there is no one left to wakeup handler. So scsi disc in these host becomes unresponsive and all bio on node hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.) Main idea of the fix is to try to do wake up every time we decrement host_busy or increment host_failed(the latter is already OK). Now the very *last* one of busy threads getting host_lock after decrementing host_busy will see all write operations on host's shost_state, host_busy and host_failed completed thanks to implied memory barriers on spin_lock/unlock, so at the time of busy==failed we will trigger wakeup in at least one thread. (Thats why putting recovery and failed checks under lock) Signed-off-by: Pavel Tikhomirov --- drivers/scsi/scsi_lib.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f6097b89d5d3..6c99221d60aa 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); + spin_lock_irqsave(shost->host_lock, flags); if (unlikely(scsi_host_in_recovery(shost) && -(shost->host_failed || shost->host_eh_scheduled))) { - spin_lock_irqsave(shost->host_lock, flags); +(shost->host_failed || shost->host_eh_scheduled))) scsi_eh_wakeup(shost); - spin_unlock_irqrestore(shost->host_lock, flags); - } + spin_unlock_irqrestore(shost->host_lock, flags); atomic_dec(&sdev->device_busy); } @@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct request_queue *q, spin_unlock_irq(shost->host_lock); out_dec: atomic_dec(&shost->host_busy); + + spin_lock_irq(shost->host_lock); + if (unlikely(scsi_host_in_recovery(shost) && +(shost->host_failed || shost->host_eh_scheduled))) + scsi_eh_wakeup(shost); + spin_unlock_irq(shost->host_lock); + return 0; } @@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, out_dec_host_busy: atomic_dec(&shost->host_busy); + + spin_lock_irq(shost->host_lock); + if (unlikely(scsi_host_in_recovery(shost) && +(shost->host_failed || shost->host_eh_scheduled))) + scsi_eh_wakeup(shost); + spin_unlock_irq(shost->host_lock); + out_dec_target_busy: if (scsi_target(sdev)->can_queue > 0) atomic_dec(&scsi_target(sdev)->target_busy); -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
[PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy
We have a problem on several our nodes with scsi EH. Imagine such an order of execution of two threads: CPU1 scsi_eh_scmd_add CPU2 scsi_host_queue_ready /* shost->host_busy == 1 initialy */ if (shost->shost_state == SHOST_RECOVERY) /* does not get here */ return 0; lock(shost->host_lock); shost->shost_state = SHOST_RECOVERY; busy = shost->host_busy++; /* host->can_queue == 1 initialy, busy == 1 * - go to starved label */ lock(shost->host_lock) /* wait */ shost->host_failed++; /* shost->host_busy == 2, shost->host_failed == 1 */ call scsi_eh_wakeup(shost) { if (host_busy == host_failed) { /* does not get here */ wake_up_process(shost->ehandler) } } unlock(shost->host_lock) /* acquire lock */ shost->host_busy--; Finaly we do not wakeup scsi_error_handler and all other commands coming will hang as we are in never ending recovery state as there is no one left to wakeup handler. So scsi disc in these host becomes unresponsive and all bio on node hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.) Main idea of the fix is to try to do wake up every time we decrement host_busy or increment host_failed(the latter is already OK). Now the very *last* one of busy threads getting host_lock after decrementing host_busy will see all write operations on host's shost_state, host_busy and host_failed completed thanks to implied memory barriers on spin_lock/unlock, so at the time of busy==failed we will trigger wakeup in at least one thread. (Thats why putting recovery and failed checks under lock) Signed-off-by: Pavel Tikhomirov --- drivers/scsi/scsi_lib.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f6097b89d5d3..6c99221d60aa 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); + spin_lock_irqsave(shost->host_lock, flags); if (unlikely(scsi_host_in_recovery(shost) && -(shost->host_failed || shost->host_eh_scheduled))) { - spin_lock_irqsave(shost->host_lock, flags); +(shost->host_failed || shost->host_eh_scheduled))) scsi_eh_wakeup(shost); - spin_unlock_irqrestore(shost->host_lock, flags); - } + spin_unlock_irqrestore(shost->host_lock, flags); atomic_dec(&sdev->device_busy); } @@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct request_queue *q, spin_unlock_irq(shost->host_lock); out_dec: atomic_dec(&shost->host_busy); + + spin_lock_irq(shost->host_lock); + if (unlikely(scsi_host_in_recovery(shost) && +(shost->host_failed || shost->host_eh_scheduled))) + scsi_eh_wakeup(shost); + spin_unlock_irq(shost->host_lock); + return 0; } @@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, out_dec_host_busy: atomic_dec(&shost->host_busy); + + spin_lock_irq(shost->host_lock); + if (unlikely(scsi_host_in_recovery(shost) && +(shost->host_failed || shost->host_eh_scheduled))) + scsi_eh_wakeup(shost); + spin_unlock_irq(shost->host_lock); + out_dec_target_busy: if (scsi_target(sdev)->can_queue > 0) atomic_dec(&scsi_target(sdev)->target_busy); -- 2.13.5
[PATCH v4 2/2] prctl: propagate has_child_subreaper flag to every descendant
If process forks some children when it has is_child_subreaper flag enabled they will inherit has_child_subreaper flag - first group, when is_child_subreaper is disabled forked children will not inherit it - second group. So child-subreaper does not reparent all his descendants when their parents die. Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. To simplify these we can add a propagation of has_child_subreaper flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child- subreaper to setup has_child_subreaper flag. In common cases when process like systemd first sets itself to be a child-subreaper and only after that forks its services, we will have zero-length list of descendants to walk. Testing with binary subtree of 2^15 processes prctl took < 0.007 sec and has shown close to linear dependency(~0.2 * n * usec) on lower numbers of processes. Moreover, I doubt someone intentionaly pre-forks the children whitch should reparent to init before becoming subreaper, because some our ancestor migh have had is_child_subreaper flag while forking our sub-tree and our childs will all inherit has_child_subreaper flag, and we have no way to influence it. And only way to check if we have no has_child_subreaper flag is to create some childs, kill them and see where they will reparent to. Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing seems to be the same. Optimize: a) When descendant already has has_child_subreaper flag all his subtree has it too already. * for a) to be true need to move has_child_subreaper inheritance under the same tasklist_lock with adding task to its ->real_parent->children as without it process can inherit zero has_child_subreaper, then we set 1 to it's parent flag, check that parent has no more children, and only after child with wrong flag is added to the tree. * Also make these inheritance more clear by using real_parent instead of current, as on clone(CLONE_PARENT) if current has is_child_subreaper and real_parent has no is_child_subreaper or has_child_subreaper, child will have has_child_subreaper flag set without actually having a subreaper in it's ancestors. b) When some descendant is child_reaper, it's subtree is in different pidns from us(original child-subreaper) and processes from other pidns will never reparent to us. So we can skip their(a,b) subtree from walk. v2: switch to walk_process_tree() general helper, move has_child_subreaper inheritance v3: remove csr_descendant leftover, change current to real_parent in has_child_subreaper inheritance v4: small commit message fix Signed-off-by: Pavel Tikhomirov --- kernel/fork.c | 10 +++--- kernel/sys.c | 22 ++ 2 files changed, 29 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 135b7a4..c814e59 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1367,9 +1367,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; - sig->has_child_subreaper = current->signal->has_child_subreaper || - current->signal->is_child_subreaper; - mutex_init(&sig->cred_guard_mutex); return 0; @@ -1800,6 +1797,13 @@ static __latent_entropy struct task_struct *copy_process( p->signal->leader_pid = pid; p->signal->tty = tty_kref_get(current->signal->tty); + /* +* Inherit has_child_subreaper flag under the same +* tasklist_lock with adding child to the process tree +* for propagate_has_child_subreaper optimization. +*/ + p->signal->has_child_subreaper = p->real_parent->signal->has_child_subreaper || + p->real_parent->signal->is_child_subreaper; list_add_tail(&p->sibling, &p->real_parent->children); list_add_tail_rcu(&p->tasks, &init_task.tasks); attach_pid(p, PIDTYPE_PGID); diff --git a/kernel/sys.c b/kernel/sys.c index 842914e..0e4d566 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2063,6 +2063,24 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) } #endif +static int propagate_has_child_subreaper(struct task_struct *p, void *data) +{ + /* +* If task has has_child_subreaper - all its decendants +* already have these flag too and new decendants will +
[PATCH 1/2] introduce the walk_process_tree() helper
From: Oleg Nesterov Add the new helper to walk the process tree, the next patch adds a user. Note that it visits the group leaders only, proc_visitor can do for_each_thread itself or we can trivially extend walk_process_tree() to do this. Signed-off-by: Oleg Nesterov Signed-off-by: Pavel Tikhomirov --- include/linux/sched.h | 3 +++ kernel/fork.c | 32 2 files changed, 35 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index ad3ec9e..7f8ab91 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3067,6 +3067,9 @@ extern bool current_is_single_threaded(void); #define for_each_process_thread(p, t) \ for_each_process(p) for_each_thread(p, t) +typedef int (*proc_visitor)(struct task_struct *p, void *data); +void walk_process_tree(struct task_struct *top, proc_visitor, void *); + static inline int get_nr_threads(struct task_struct *tsk) { return tsk->signal->nr_threads; diff --git a/kernel/fork.c b/kernel/fork.c index 11c5c8a..135b7a4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2053,6 +2053,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, } #endif +void walk_process_tree(struct task_struct *top, proc_visitor visitor, void *data) +{ + struct task_struct *leader, *parent, *child; + int res; + + read_lock(&tasklist_lock); + leader = top = top->group_leader; +down: + for_each_thread(leader, parent) { + list_for_each_entry(child, &parent->children, sibling) { + res = visitor(child, data); + if (res) { + if (res < 0) + goto out; + leader = child; + goto down; + } +up: + ; + } + } + + if (leader != top) { + child = leader; + parent = child->real_parent; + leader = parent->group_leader; + goto up; + } +out: + read_unlock(&tasklist_lock); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 2.9.3
[PATCH v3 2/2] prctl: propagate has_child_subreaper flag to every descendant
If process forks some children when it has is_child_subreaper flag enabled they will inherit has_child_subreaper flag - first group, when is_child_subreaper is disabled forked children will not inherit it - second group. So child-subreaper does not reparent all his descendants when their parents die. Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. To simplify these we can add a propagation of has_child_subreaper flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child- subreaper to setup has_child_subreaper flag. In common cases when process like systemd first sets itself to be a child-subreaper and only after that forks its services, we will have zero-length list of descendants to walk. Testing with binary subtree of 2^15 processes prctl took < 0.007 sec and has shown close to linear dependency(~0.2 * n * usec) on lower numbers of processes. Moreover, I doubt someone intentionaly pre-forks the children whitch should reparent to previous reaper(e.g. init) before becoming subreaper, because some our ancestor migh have had is_child_subreaper flag while forking our sub-tree and our childs will all inherit has_child_subreaper flag, and we have no way to influence it. And only way to check if we have no has_child_subreaper flag is to create some childs, kill them and see where they will reparent to. Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing seems to be the same. Optimize: a) When descendant already has has_child_subreaper flag all his subtree has it too already. * for a) to be true need to move has_child_subreaper inheritance under the same tasklist_lock with adding task to its ->real_parent->children as without it process can inherit zero has_child_subreaper, then we set 1 to it's parent flag, check that it has no more parents, and only after child with wrong flag is added to the tree. * also make these inheritance more clear by using real_parent instead of current, as on clone(CLONE_PARENT) if current has is_child_subreaper and real_parent has no is_child_subreaper or has_child_subreaper, child will have has_child_subreaper flag set without actually having a subreaper in it's ancestors. b) When some descendant is child_reaper, it's subtree is in different pidns from us(original child-subreaper) and processes from other pidns will never reparent to us. So we can skip their(a,b) subtree from walk. v2: switch to walk_process_tree() general helper, move has_child_subreaper inheritance v3: remove csr_descendant leftover, change current to real_parent in has_child_subreaper inheritance Signed-off-by: Pavel Tikhomirov --- kernel/fork.c | 10 +++--- kernel/sys.c | 22 ++ 2 files changed, 29 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 135b7a4..c814e59 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1367,9 +1367,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; - sig->has_child_subreaper = current->signal->has_child_subreaper || - current->signal->is_child_subreaper; - mutex_init(&sig->cred_guard_mutex); return 0; @@ -1800,6 +1797,13 @@ static __latent_entropy struct task_struct *copy_process( p->signal->leader_pid = pid; p->signal->tty = tty_kref_get(current->signal->tty); + /* +* Inherit has_child_subreaper flag under the same +* tasklist_lock with adding child to the process tree +* for propagate_has_child_subreaper optimization. +*/ + p->signal->has_child_subreaper = p->real_parent->signal->has_child_subreaper || + p->real_parent->signal->is_child_subreaper; list_add_tail(&p->sibling, &p->real_parent->children); list_add_tail_rcu(&p->tasks, &init_task.tasks); attach_pid(p, PIDTYPE_PGID); diff --git a/kernel/sys.c b/kernel/sys.c index 842914e..0e4d566 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2063,6 +2063,24 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) } #endif +static int propagate_has_child_subreaper(struct task_struct *p, void *data) +{ + /* +* If task has has_child_subreaper - all its decendants +* already have these flag too and new decendants will +* inher
[PATCH 1/2] introduce the walk_process_tree() helper
From: Oleg Nesterov Add the new helper to walk the process tree, the next patch adds a user. Note that it visits the group leaders only, proc_visitor can do for_each_thread itself or we can trivially extend walk_process_tree() to do this. Signed-off-by: Oleg Nesterov Signed-off-by: Pavel Tikhomirov --- include/linux/sched.h | 3 +++ kernel/fork.c | 32 2 files changed, 35 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index ad3ec9e..7f8ab91 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3067,6 +3067,9 @@ extern bool current_is_single_threaded(void); #define for_each_process_thread(p, t) \ for_each_process(p) for_each_thread(p, t) +typedef int (*proc_visitor)(struct task_struct *p, void *data); +void walk_process_tree(struct task_struct *top, proc_visitor, void *); + static inline int get_nr_threads(struct task_struct *tsk) { return tsk->signal->nr_threads; diff --git a/kernel/fork.c b/kernel/fork.c index 11c5c8a..135b7a4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2053,6 +2053,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, } #endif +void walk_process_tree(struct task_struct *top, proc_visitor visitor, void *data) +{ + struct task_struct *leader, *parent, *child; + int res; + + read_lock(&tasklist_lock); + leader = top = top->group_leader; +down: + for_each_thread(leader, parent) { + list_for_each_entry(child, &parent->children, sibling) { + res = visitor(child, data); + if (res) { + if (res < 0) + goto out; + leader = child; + goto down; + } +up: + ; + } + } + + if (leader != top) { + child = leader; + parent = child->real_parent; + leader = parent->group_leader; + goto up; + } +out: + read_unlock(&tasklist_lock); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 2.9.3
[PATCH v4 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic
Oleg Nesterov (1): introduce the walk_process_tree() helper Pavel Tikhomirov (1): prctl: propagate has_child_subreaper flag to every descendant include/linux/sched.h | 3 +++ kernel/fork.c | 42 +++--- kernel/sys.c | 22 ++ 3 files changed, 64 insertions(+), 3 deletions(-) -- 2.9.3
[PATCH v3 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic
Oleg Nesterov (1): introduce the walk_process_tree() helper Pavel Tikhomirov (1): prctl: propagate has_child_subreaper flag to every descendant include/linux/sched.h | 3 +++ kernel/fork.c | 42 +++--- kernel/sys.c | 22 ++ 3 files changed, 64 insertions(+), 3 deletions(-) -- 2.9.3
Re: [PATCH v3 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic
please drop it, errors in commit message On 01/30/2017 05:48 PM, Pavel Tikhomirov wrote: Oleg Nesterov (1): introduce the walk_process_tree() helper Pavel Tikhomirov (1): prctl: propagate has_child_subreaper flag to every descendant include/linux/sched.h | 3 +++ kernel/fork.c | 42 +++--- kernel/sys.c | 22 ++ 3 files changed, 64 insertions(+), 3 deletions(-) -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH v2 2/2] prctl: propagate has_child_subreaper flag to every descendant
On 01/30/2017 03:51 PM, Oleg Nesterov wrote: On 01/27, Pavel Tikhomirov wrote: --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1725,6 +1725,8 @@ struct task_struct { struct signal_struct *signal; struct sighand_struct *sighand; + struct list_head csr_descendant; + You forgot to remove this part ;) Oh, sure. --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1367,9 +1367,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; - sig->has_child_subreaper = current->signal->has_child_subreaper || - current->signal->is_child_subreaper; - mutex_init(&sig->cred_guard_mutex); return 0; @@ -1800,6 +1797,13 @@ static __latent_entropy struct task_struct *copy_process( p->signal->leader_pid = pid; p->signal->tty = tty_kref_get(current->signal->tty); + /* +* Inherit has_child_subreaper flag under the same +* tasklist_lock with adding child to the process tree +* for propagate_has_child_subreaper optimization. +*/ + p->signal->has_child_subreaper = current->signal->has_child_subreaper || + current->signal->is_child_subreaper; Ah yes, we need this change too... And perhaps it would be more correct to do p->signal->has_child_subreaper = p->real_parent->signal->has_child_subreaper || p->real_parent->signal->is_child_subreaper; the current code is not exactly right if CLONE_PARENT... I'm fine with inheriting 'has' flag from real_parent, because if real_parent does not have 'has' flag set but parent has 'has' set, we inherited the flag in vain. But I don't actually think that inheritance from parent not real_parent breaks my optimization: if real_parent has the flag, so does the parent. Oleg. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] prctl.2: Document new PR_SET_CHILD_SUBREAPER semantics
On 01/28/2017 01:47 AM, Michael Kerrisk (man-pages) wrote: Hello Pavel, On 27 January 2017 at 23:11, Pavel Tikhomirov wrote: old semantics was non deterministic and worked differently depending on the external factors, but nothing changes if process first sets itself subreaper and only after forks When did the kernel behavior change? Sorry for inconvenience, should have added you to cc of a main patch also. Kernel behavior haven't changed yet, doc change will be a continuation of "[PATCH v2 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic", see https://lkml.org/lkml/2017/1/27/108 Cheers, Michael Signed-off-by: Pavel Tikhomirov --- man2/prctl.2 | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/man2/prctl.2 b/man2/prctl.2 index 97cf21a..84fbd7e 100644 --- a/man2/prctl.2 +++ b/man2/prctl.2 @@ -162,20 +162,30 @@ if is zero, unset the attribute. When a process is marked as a child subreaper, -all of the children that it creates, and their descendants, +all of the children that it creates or have created already, and their descendants, will be marked as having a subreaper. In effect, a subreaper fulfills the role of .BR init (1) for its descendant processes. -Upon termination of a process -that is orphaned (i.e., its immediate parent has already terminated) -and marked as having a subreaper, -the nearest still living ancestor subreaper -will receive a +Upon termination of a process having a subreaper, +all its children become orphaned +and will be reparented to the nearest still living ancestor subreaper. +So that on it's adopted child termination +these subreaper will receive a .BR SIGCHLD signal and will be able to .BR wait (2) -on the process to discover its termination status. +on the child to discover its termination status. + +Note, that on older kernels these prctl works slightly different. +Child subreaper process was not actualy the +.BR init (1) +for all its descendants. +If process forks a child while not been a child subreaper, +and after sets himself child subreaper, +sub-tree of the child might or might not reparent to the subreaper, +depending on the configuration of ancestors of the subreaper, +at the time of forking our subtree. .TP .BR PR_GET_CHILD_SUBREAPER " (since Linux 3.4)" Return the "child subreaper" setting of the caller, -- 2.9.3 -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
[PATCH] prctl.2: Document new PR_SET_CHILD_SUBREAPER semantics
old semantics was non deterministic and worked differently depending on the external factors, but nothing changes if process first sets itself subreaper and only after forks Signed-off-by: Pavel Tikhomirov --- man2/prctl.2 | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/man2/prctl.2 b/man2/prctl.2 index 97cf21a..84fbd7e 100644 --- a/man2/prctl.2 +++ b/man2/prctl.2 @@ -162,20 +162,30 @@ if is zero, unset the attribute. When a process is marked as a child subreaper, -all of the children that it creates, and their descendants, +all of the children that it creates or have created already, and their descendants, will be marked as having a subreaper. In effect, a subreaper fulfills the role of .BR init (1) for its descendant processes. -Upon termination of a process -that is orphaned (i.e., its immediate parent has already terminated) -and marked as having a subreaper, -the nearest still living ancestor subreaper -will receive a +Upon termination of a process having a subreaper, +all its children become orphaned +and will be reparented to the nearest still living ancestor subreaper. +So that on it's adopted child termination +these subreaper will receive a .BR SIGCHLD signal and will be able to .BR wait (2) -on the process to discover its termination status. +on the child to discover its termination status. + +Note, that on older kernels these prctl works slightly different. +Child subreaper process was not actualy the +.BR init (1) +for all its descendants. +If process forks a child while not been a child subreaper, +and after sets himself child subreaper, +sub-tree of the child might or might not reparent to the subreaper, +depending on the configuration of ancestors of the subreaper, +at the time of forking our subtree. .TP .BR PR_GET_CHILD_SUBREAPER " (since Linux 3.4)" Return the "child subreaper" setting of the caller, -- 2.9.3
[PATCH v2 2/2] prctl: propagate has_child_subreaper flag to every descendant
If process forks some children when it has is_child_subreaper flag enabled they will inherit has_child_subreaper flag - first group, when is_child_subreaper is disabled forked children will not inherit it - second group. So child-subreaper does not reparent all his descendants when their parents die. Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. To simplify these we can add a propagation of has_child_subreaper flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child- subreaper to setup has_child_subreaper flag. In common cases when process like systemd first sets itself to be a child-subreaper and only after that forks its services, we will have zero-length list of descendants to walk. Testing with binary subtree of 2^15 processes prctl took < 0.007 sec and has shown close to linear dependency(~0.2 * n * usec) on lower numbers of processes. Moreover, I doubt someone intentionaly pre-forks the children whitch should reparent to init before becoming subreaper, because some our ancestor migh have had is_child_subreaper flag while forking our sub-tree and our childs will all inherit has_child_subreaper flag, and we have no way to influence it. And only way to check if we have no has_child_subreaper flag is to create some childs, kill them and see where they will reparent to. Use walk_process_tree helper to walk subtree, thanks to Oleg! Timing seems to be the same. Optimize: a) When descendant already has has_child_subreaper flag all his subtree has it too already. * for a) to be true need to move has_child_subreaper inheritance under the same tasklist_lock with adding task to its ->real_parent->children as without it process can inherit zero has_child_subreaper, then we set 1 to it's parent flag, check that parent has no more children, and only after child with wrong flag is added to the tree. b) When some descendant is child_reaper, it's subtree is in different pidns from us(original child-subreaper) and processes from other pidns will never reparent to us. So we can skip their(a,b) subtree from walk. v2: switch to walk_process_tree() general helper, move has_child_subreaper inheritance Signed-off-by: Pavel Tikhomirov --- include/linux/sched.h | 2 ++ kernel/fork.c | 10 +++--- kernel/sys.c | 22 ++ 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7f8ab91..a08f006 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1725,6 +1725,8 @@ struct task_struct { struct signal_struct *signal; struct sighand_struct *sighand; + struct list_head csr_descendant; + sigset_t blocked, real_blocked; sigset_t saved_sigmask; /* restored if set_restore_sigmask() was used */ struct sigpending pending; diff --git a/kernel/fork.c b/kernel/fork.c index 135b7a4..5874d01 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1367,9 +1367,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; - sig->has_child_subreaper = current->signal->has_child_subreaper || - current->signal->is_child_subreaper; - mutex_init(&sig->cred_guard_mutex); return 0; @@ -1800,6 +1797,13 @@ static __latent_entropy struct task_struct *copy_process( p->signal->leader_pid = pid; p->signal->tty = tty_kref_get(current->signal->tty); + /* +* Inherit has_child_subreaper flag under the same +* tasklist_lock with adding child to the process tree +* for propagate_has_child_subreaper optimization. +*/ + p->signal->has_child_subreaper = current->signal->has_child_subreaper || + current->signal->is_child_subreaper; list_add_tail(&p->sibling, &p->real_parent->children); list_add_tail_rcu(&p->tasks, &init_task.tasks); attach_pid(p, PIDTYPE_PGID); diff --git a/kernel/sys.c b/kernel/sys.c index 842914e..0e4d566 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2063,6 +2063,24 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) } #endif +static int propagate_has_child_subreaper(struct task_struct *p, void *data) +{ + /* +* If task has has_child_subreaper - all its decendants +* already
[PATCH 1/2] introduce the walk_process_tree() helper
From: Oleg Nesterov Add the new helper to walk the process tree, the next patch adds a user. Note that it visits the group leaders only, proc_visitor can do for_each_thread itself or we can trivially extend walk_process_tree() to do this. Signed-off-by: Oleg Nesterov Reviewed-by: Pavel Tikhomirov --- include/linux/sched.h | 3 +++ kernel/fork.c | 32 2 files changed, 35 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index ad3ec9e..7f8ab91 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3067,6 +3067,9 @@ extern bool current_is_single_threaded(void); #define for_each_process_thread(p, t) \ for_each_process(p) for_each_thread(p, t) +typedef int (*proc_visitor)(struct task_struct *p, void *data); +void walk_process_tree(struct task_struct *top, proc_visitor, void *); + static inline int get_nr_threads(struct task_struct *tsk) { return tsk->signal->nr_threads; diff --git a/kernel/fork.c b/kernel/fork.c index 11c5c8a..135b7a4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2053,6 +2053,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, } #endif +void walk_process_tree(struct task_struct *top, proc_visitor visitor, void *data) +{ + struct task_struct *leader, *parent, *child; + int res; + + read_lock(&tasklist_lock); + leader = top = top->group_leader; +down: + for_each_thread(leader, parent) { + list_for_each_entry(child, &parent->children, sibling) { + res = visitor(child, data); + if (res) { + if (res < 0) + goto out; + leader = child; + goto down; + } +up: + ; + } + } + + if (leader != top) { + child = leader; + parent = child->real_parent; + leader = parent->group_leader; + goto up; + } +out: + read_unlock(&tasklist_lock); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 2.9.3
[PATCH v2 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic
will send documentation change proposal in reply to these letter Oleg Nesterov (1): introduce the walk_process_tree() helper Pavel Tikhomirov (1): prctl: propagate has_child_subreaper flag to every descendant include/linux/sched.h | 5 + kernel/fork.c | 42 +++--- kernel/sys.c | 22 ++ 3 files changed, 66 insertions(+), 3 deletions(-) -- 2.9.3
Re: [PATCH] introduce the walk_process_tree() helper
Will include it in patch-set with documentation fix. Thanks Oleg! Reviewed-by: Pavel Tikhomirov On 01/23/2017 02:57 PM, Oleg Nesterov wrote: Add the new helper to walk the process tree, the next patch adds a user. Note that it visits the group leaders only, proc_visitor can do for_each_thread itself or we can trivially extend walk_process_tree() to do this. Signed-off-by: Oleg Nesterov --- include/linux/sched.h | 3 +++ kernel/fork.c | 32 2 files changed, 35 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index e9c009d..681f748 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3015,6 +3015,9 @@ extern bool current_is_single_threaded(void); #define for_each_process_thread(p, t) \ for_each_process(p) for_each_thread(p, t) +typedef int (*proc_visitor)(struct task_struct *p, void *data); +void walk_process_tree(struct task_struct *top, proc_visitor, void *); + static inline int get_nr_threads(struct task_struct *tsk) { return tsk->signal->nr_threads; diff --git a/kernel/fork.c b/kernel/fork.c index 663c6a7..2522bae 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2042,6 +2042,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, } #endif +void walk_process_tree(struct task_struct *top, proc_visitor visitor, void *data) +{ + struct task_struct *leader, *parent, *child; + int res; + + read_lock(&tasklist_lock); + leader = top = top->group_leader; +down: + for_each_thread(leader, parent) { + list_for_each_entry(child, &parent->children, sibling) { + res = visitor(child, data); + if (res) { + if (res < 0) + goto out; + leader = child; + goto down; + } +up: + ; + } + } + + if (leader != top) { + child = leader; + parent = child->real_parent; + leader = parent->group_leader; + goto up; + } +out: + read_unlock(&tasklist_lock); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant
Add to cc Lennart Poettering On 01/23/2017 02:55 PM, Oleg Nesterov wrote: On 01/22, Pavel Tikhomirov wrote: Hmm. could you explain how this change helps CRIU? I mean, why restorer can't do prctl(CHILD_SUBREAPER) before the first fork? Imagine we have these tree in pidns: 1: has_child_subreaper == 0 && is_child_subreaper == 0 |-2: has_child_subreaper == 0 && is_child_subreaper == 1 | |-3: has_child_subreaper == 0 && is_child_subreaper == 0 | | |-5: has_child_subreaper == 0 && is_child_subreaper == 0 | |-4: has_child_subreaper == 1 && is_child_subreaper == 0 | | |-6: has_child_subreaper == 1 && is_child_subreaper == 0 before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1. after c/r: (where restorer had is_child_subreaper == 1, everybody in the tree will have has_child_subreaper == 1) Everybody will reparent to 2. This is clear, but this can only happen if 2 forks 3 and after that sets is_child_subreaper, right? And if someone actually does this then your patch can break this application, no? IOW. Currently CRIU can't restore the process tree with the same has_child_subreaper bits if some process forks before prctl(PR_SET_CHILD_SUBREAPER). It restores the tree as if prctl() was called before the 1st fork. So you change the semantics of PR_SET_CHILD_SUBREAPER and now CRIU is fine simply because you remove this feature: the sub-reaper can no longer pre-fork the children which should reparent to the previous reaper. I won't really argure, but I am not sure this is good idea... If one task uses these feature now it must be very carefull: if some our ancestor have enabled is_child_subreaper somewhere up the tree, forked our tree and after that disabled is_child_subreaper, so we already have has-flag and all children will inherit has-flag irrelevant to what is our order of fork/prctl-ing to become subreaper. And only one way to check if the task has no has_child_subreaper flag is to create some childs kill them and see to where they will reparent, but I doubt that someone is doing these now. At least I think this should be clearly documented. Yes, I surely need to add some documentation here. Thanks for mentioning that! - Will do. You don't need this new member and descendants_lock. task_struct has the ->real_parent pointer so you can work the tree without recursion. Sorry I don't get how I can walk down the tree of all descendants with help of ->real_parent pointer, can you please point on some example or explain a bit more? (I see task_is_descendant() in security/yama/yama_lsm.c but we will need to check it for every process, not only descendants, the latter can be a lot faster.) I'll send a patch, probably a generic helper makes sense. Btw task_is_descendant() looks wrong at first glance. Oleg. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant
Sorry I had some problem with mail-agent, resend to be on the safe side. On 01/20/2017 09:14 PM, Oleg Nesterov wrote: On 01/19, Pavel Tikhomirov wrote: Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. Hmm. could you explain how this change helps CRIU? I mean, why restorer can't do prctl(CHILD_SUBREAPER) before the first fork? Imagine we have these tree in pidns: 1: has_child_subreaper == 0 && is_child_subreaper == 0 |-2: has_child_subreaper == 0 && is_child_subreaper == 1 | |-3: has_child_subreaper == 0 && is_child_subreaper == 0 | | |-5: has_child_subreaper == 0 && is_child_subreaper == 0 | |-4: has_child_subreaper == 1 && is_child_subreaper == 0 | | |-6: has_child_subreaper == 1 && is_child_subreaper == 0 before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1. after c/r: (where restorer had is_child_subreaper == 1, everybody in the tree will have has_child_subreaper == 1) Everybody will reparent to 2. Anyway, afaics the patch is sub-optimal and not correct... --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1715,6 +1715,8 @@ struct task_struct { struct signal_struct *signal; struct sighand_struct *sighand; + struct list_head csr_descendant; + You don't need this new member and descendants_lock. task_struct has the ->real_parent pointer so you can work the tree without recursion. Sorry I don't get how I can walk down the tree of all descendants with help of ->real_parent pointer, can you please point on some example or explain a bit more? (I see task_is_descendant() in security/yama/yama_lsm.c but we will need to check it for every process, not only descendants, the latter can be a lot faster.) +static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2) +{ + LIST_HEAD(descendants); + + reaper->signal->is_child_subreaper = arg2; + if (!arg2) + return; + + spin_lock(&descendants_lock); + read_lock(&tasklist_lock); + + list_add(&reaper->csr_descendant, &descendants); + + while (!list_empty(&descendants)) { + struct task_struct *tsk; + struct task_struct *p; + + tsk = list_first_entry(&descendants, struct task_struct, + csr_descendant); + + list_for_each_entry(p, &tsk->children, sibling) { This is not enough. Every thread has its own ->children list, you need to walk the sub-threads as well. Will do. +* If we've found child_reaper - skip descendants in +* it's subtree as they will never get out pidns +*/ + if (is_child_reaper(task_pid(p))) + continue; Again, a child reaper can be multi-threaded, this check can be false negative. Probably is_child_reaper() should be renamed somehow and a new helper makes sense... something like Will do. bool task_is_child_reaper(struct task_struct *p) { return same_thread_group(p, task_active_pid_ns(p)->child_reaper); } Oleg. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant
On 01/20/2017 09:14 PM, Oleg Nesterov wrote: On 01/19, Pavel Tikhomirov wrote: Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. Hmm. could you explain how this change helps CRIU? I mean, why restorer can't do prctl(CHILD_SUBREAPER) before the first fork? Imagine we have these tree in pidns: 1: has_child_subreaper == 0 && is_child_subreaper == 0 |-2: has_child_subreaper == 0 && is_child_subreaper == 1 | |-3: has_child_subreaper == 0 && is_child_subreaper == 0 | | |-5: has_child_subreaper == 0 && is_child_subreaper == 0 | |-4: has_child_subreaper == 1 && is_child_subreaper == 0 | | |-6: has_child_subreaper == 1 && is_child_subreaper == 0 before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1. after c/r: (where restorer had is_child_subreaper == 1, everybody in the tree will have has_child_subreaper == 1) Everybody will reparent to 2. Anyway, afaics the patch is sub-optimal and not correct... --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1715,6 +1715,8 @@ struct task_struct { struct signal_struct *signal; struct sighand_struct *sighand; + struct list_head csr_descendant; + You don't need this new member and descendants_lock. task_struct has the ->real_parent pointer so you can work the tree without recursion. Sorry I don't get how I can walk down the tree of all descendants with help of ->real_parent pointer, can you please point on some example or explain a bit more? (I see task_is_descendant() in security/yama/yama_lsm.c but we will need to check it for every process, not only descendants, the latter can be a lot faster.) +static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2) +{ + LIST_HEAD(descendants); + + reaper->signal->is_child_subreaper = arg2; + if (!arg2) + return; + + spin_lock(&descendants_lock); + read_lock(&tasklist_lock); + + list_add(&reaper->csr_descendant, &descendants); + + while (!list_empty(&descendants)) { + struct task_struct *tsk; + struct task_struct *p; + + tsk = list_first_entry(&descendants, struct task_struct, + csr_descendant); + + list_for_each_entry(p, &tsk->children, sibling) { This is not enough. Every thread has its own ->children list, you need to walk the sub-threads as well. Will do. +* If we've found child_reaper - skip descendants in +* it's subtree as they will never get out pidns +*/ + if (is_child_reaper(task_pid(p))) + continue; Again, a child reaper can be multi-threaded, this check can be false negative. Probably is_child_reaper() should be renamed somehow and a new helper makes sense... something like Will do. bool task_is_child_reaper(struct task_struct *p) { return same_thread_group(p, task_active_pid_ns(p)->child_reaper); } Oleg. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.
[PATCH] prctl: propagate has_child_subreaper flag to every descendant
If process forks some children when it has is_child_subreaper flag enabled they will inherit has_child_subreaper flag - first group, when is_child_subreaper is disabled forked children will not inherit it - second group. So child-subreaper does not reparent all his descendants when their parents die. Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. To simplify these we can add a propagation of has_child_subreaper flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child- subreaper to setup has_child_subreaper flag. In common cases when process like systemd first sets itself to be a child-subreaper and only after that forks its services, we will have zero-length list of descendants to walk. Testing with binary subtree of 2^15 processes prctl took < 0.007 sec and has shown close to linear dependency(~0.2 * n * usec) on lower numbers of processes. Using csr_descendant list to collect descendants and do tree walk without recursion. Optimize: a) When descendant already has has_child_subreaper flag all his subtree has it too already. b) When some descendant is child_reaper, it's subtree is in different pidns from us(original child-subreaper) and processes from other pidns will never reparent to us. So we can skip their(a,b) subtree from walk. Signed-off-by: Pavel Tikhomirov --- include/linux/sched.h | 2 ++ kernel/sys.c | 51 ++- 2 files changed, 52 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 4d19052..9cb44c4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1715,6 +1715,8 @@ struct task_struct { struct signal_struct *signal; struct sighand_struct *sighand; + struct list_head csr_descendant; + sigset_t blocked, real_blocked; sigset_t saved_sigmask; /* restored if set_restore_sigmask() was used */ struct sigpending pending; diff --git a/kernel/sys.c b/kernel/sys.c index 842914e..05b6d7d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2063,6 +2063,55 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) } #endif +static DEFINE_SPINLOCK(descendants_lock); + +static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2) +{ + LIST_HEAD(descendants); + + reaper->signal->is_child_subreaper = arg2; + if (!arg2) + return; + + spin_lock(&descendants_lock); + read_lock(&tasklist_lock); + + list_add(&reaper->csr_descendant, &descendants); + + while (!list_empty(&descendants)) { + struct task_struct *tsk; + struct task_struct *p; + + tsk = list_first_entry(&descendants, struct task_struct, + csr_descendant); + + list_for_each_entry(p, &tsk->children, sibling) { + /* +* If tsk has has_child_subreaper - all its decendants +* already have these flag too and new decendants will +* inherit it on fork, so nothing to be done here. +*/ + if (p->signal->has_child_subreaper) + continue; + + /* +* If we've found child_reaper - skip descendants in +* it's subtree as they will never get out pidns +*/ + if (is_child_reaper(task_pid(p))) + continue; + + p->signal->has_child_subreaper = 1; + list_add(&p->csr_descendant, &descendants); + } + + list_del_init(&tsk->csr_descendant); + } + + read_unlock(&tasklist_lock); + spin_unlock(&descendants_lock); +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2213,7 +2262,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = prctl_get_tid_address(me, (int __user **)arg2); break; case PR_SET_CHILD_SUBREAPER: - me->signal->is_child_subreaper = !!arg2; + prctl_set_child_subreaper(me, !!arg2); break; case PR_GET_CHILD_SUBREAPER: error = put_user(me->signal->is_child_subreaper, -- 2.9.3
[PATCH v2] ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int
> cat /proc/sys/net/ipv4/tcp_notsent_lowat -1 > echo 4294967295 > /proc/sys/net/ipv4/tcp_notsent_lowat -bash: echo: write error: Invalid argument > echo -2147483648 > /proc/sys/net/ipv4/tcp_notsent_lowat > cat /proc/sys/net/ipv4/tcp_notsent_lowat -2147483648 but in documentation we have "tcp_notsent_lowat - UNSIGNED INTEGER" v2: simplify to just proc_douintvec Signed-off-by: Pavel Tikhomirov --- net/ipv4/sysctl_net_ipv4.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 80bc36b..566cfc5 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -958,7 +958,7 @@ static struct ctl_table ipv4_net_table[] = { .data = &init_net.ipv4.sysctl_tcp_notsent_lowat, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_douintvec, }, #ifdef CONFIG_IP_ROUTE_MULTIPATH { -- 2.9.3
[PATCH] ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int
> cat /proc/sys/net/ipv4/tcp_notsent_lowat -1 > echo 4294967295 > /proc/sys/net/ipv4/tcp_notsent_lowat -bash: echo: write error: Invalid argument > echo -2147483648 > /proc/sys/net/ipv4/tcp_notsent_lowat > cat /proc/sys/net/ipv4/tcp_notsent_lowat -2147483648 but in documentation we have "tcp_notsent_lowat - UNSIGNED INTEGER" Signed-off-by: Pavel Tikhomirov --- net/ipv4/sysctl_net_ipv4.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 80bc36b..5361373 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; +static unsigned int uint_max = UINT_MAX; /* Update system visible IP port range */ static void set_local_port_range(struct net *net, int range[2]) @@ -958,7 +959,9 @@ static struct ctl_table ipv4_net_table[] = { .data = &init_net.ipv4.sysctl_tcp_notsent_lowat, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_doulongvec_minmax, + .extra1 = &zero, + .extra2 = &uint_max, }, #ifdef CONFIG_IP_ROUTE_MULTIPATH { -- 2.9.3
[PATCH v2] netfilter: nf_log: fix error on write NONE to logger choice sysctl
It is hard to unbind nf-logger: echo NONE > /proc/sys/net/netfilter/nf_log/0 bash: echo: write error: No such file or directory sysctl -w net.netfilter.nf_log.0=NONE sysctl: setting key "net.netfilter.nf_log.0": No such file or directory net.netfilter.nf_log.0 = NONE You need explicitly send '\0', for instance like: echo -e "NONE\0" > /proc/sys/net/netfilter/nf_log/0 That seem to be strange, so fix it using proc_dostring. Now it works fine: modprobe nfnetlink_log echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/0 cat /proc/sys/net/netfilter/nf_log/0 nfnetlink_log echo NONE > /proc/sys/net/netfilter/nf_log/0 cat /proc/sys/net/netfilter/nf_log/0 NONE v2: add missed error check for proc_dostring Signed-off-by: Pavel Tikhomirov --- net/netfilter/nf_log.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c index a5d41df..58ea5b0 100644 --- a/net/netfilter/nf_log.c +++ b/net/netfilter/nf_log.c @@ -398,16 +398,17 @@ static int nf_log_proc_dostring(struct ctl_table *table, int write, { const struct nf_logger *logger; char buf[NFLOGGER_NAME_LEN]; - size_t size = *lenp; int r = 0; int tindex = (unsigned long)table->extra1; struct net *net = current->nsproxy->net_ns; if (write) { - if (size > sizeof(buf)) - size = sizeof(buf); - if (copy_from_user(buf, buffer, size)) - return -EFAULT; + struct ctl_table tmp = *table; + + tmp.data = buf; + r = proc_dostring(&tmp, write, buffer, lenp, ppos); + if (r) + return r; if (!strcmp(buf, "NONE")) { nf_log_unbind_pf(net, tindex); -- 2.5.5
[PATCH] netfilter: nf_log: fix error on write NONE to logger choice sysctl
It is hard to unbind nf-logger: echo NONE > /proc/sys/net/netfilter/nf_log/0 bash: echo: write error: No such file or directory sysctl -w net.netfilter.nf_log.0=NONE sysctl: setting key "net.netfilter.nf_log.0": No such file or directory net.netfilter.nf_log.0 = NONE You need explicitly send '\0', for instance like: echo -e "NONE\0" > /proc/sys/net/netfilter/nf_log/0 That seem to be strange, so fix it using proc_dostring. Now it works fine: modprobe nfnetlink_log echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/0 cat /proc/sys/net/netfilter/nf_log/0 nfnetlink_log echo NONE > /proc/sys/net/netfilter/nf_log/0 cat /proc/sys/net/netfilter/nf_log/0 NONE Signed-off-by: Pavel Tikhomirov --- net/netfilter/nf_log.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c index a5d41df..c7c5882 100644 --- a/net/netfilter/nf_log.c +++ b/net/netfilter/nf_log.c @@ -398,16 +398,15 @@ static int nf_log_proc_dostring(struct ctl_table *table, int write, { const struct nf_logger *logger; char buf[NFLOGGER_NAME_LEN]; - size_t size = *lenp; int r = 0; int tindex = (unsigned long)table->extra1; struct net *net = current->nsproxy->net_ns; if (write) { - if (size > sizeof(buf)) - size = sizeof(buf); - if (copy_from_user(buf, buffer, size)) - return -EFAULT; + struct ctl_table tmp = *table; + + tmp.data = buf; + proc_dostring(&tmp, write, buffer, lenp, ppos); if (!strcmp(buf, "NONE")) { nf_log_unbind_pf(net, tindex); -- 2.5.5
Re: BUG: KASAN: slab-out-of-bounds in ses_enclosure_data_process+0x900/0xe50
I have very similar problem with SAS2X28, please take a look on a bug report here https://bugzilla.kernel.org/show_bug.cgi?id=108771 Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:timers/urgent] posix-timers: Show clock ID in proc file
Commit-ID: 15ef0298deb3929eb6ad6d2334fd2059fd53807c Gitweb: http://git.kernel.org/tip/15ef0298deb3929eb6ad6d2334fd2059fd53807c Author: Pavel Tikhomirov AuthorDate: Fri, 17 May 2013 02:12:03 +0400 Committer: Thomas Gleixner CommitDate: Tue, 28 May 2013 11:41:14 +0200 posix-timers: Show clock ID in proc file Expand information about posix-timers in /proc//timers by adding info about clock, with which the timer was created. I.e. in the forth line of timer info after "notify:" line go "ClockID: ". Signed-off-by: Pavel Tikhomirov Cc: Michael Kerrisk Cc: Matthew Helsley Cc: Pavel Emelyanov Link: http://lkml.kernel.org/r/1368742323-46949-2-git-send-email-snor...@gmail.com Signed-off-by: Thomas Gleixner --- fs/proc/base.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index dd51e50..c3834da 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2118,6 +2118,7 @@ static int show_timer(struct seq_file *m, void *v) nstr[notify & ~SIGEV_THREAD_ID], (notify & SIGEV_THREAD_ID) ? "tid" : "pid", pid_nr_ns(timer->it_pid, tp->ns)); + seq_printf(m, "ClockID: %d\n", timer->it_clock); return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] posix-timers: Show clock ID in proc file
Expand information about posix-timers in /proc//timers by adding info about clock, with which the timer was created. I.e. in the forth line of timer info after "notify:" line go "ClockID: ". Signed-off-by: Pavel Tikhomirov --- fs/proc/base.c |1 + 1 file changed, 1 insertion(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index 2dad4a9..8a38eef 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2079,6 +2079,7 @@ static int show_timer(struct seq_file *m, void *v) nstr[notify & ~SIGEV_THREAD_ID], (notify & SIGEV_THREAD_ID) ? "tid" : "pid", pid_nr_ns(timer->it_pid, tp->ns)); + seq_printf(m, "ClockID: %d\n", timer->it_clock); return 0; } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/1] posix timers: Expand exitsting info in proc file
Hi. I'm working on the checkpoint-restore project (http://criu.org), on realisation of posix-timers. To compleatly checkpoint and restore these timers we need to know which clock they are using. So I d'like to add this information to existing syscall which shows posix-timers info. Pavel Tikhomirov (1): posix-timers: Show clock ID in proc file fs/proc/base.c |1 + 1 file changed, 1 insertion(+) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/