Re: [PATCH] net: sched: sch_teql: fix null-pointer dereference

2021-04-08 Thread Pavel Tikhomirov




On 4/8/21 6:26 PM, Eric Dumazet wrote:



On 4/8/21 5:14 PM, Pavel Tikhomirov wrote:

Reproduce:

   modprobe sch_teql
   tc qdisc add dev teql0 root teql0

This leads to (for instance in Centos 7 VM) OOPS:



Null pointer dereference happens on master->slaves dereference in
teql_destroy() as master is null-pointer.

When qdisc_create() calls teql_qdisc_init() it imediately fails after
check "if (m->dev == dev)" because both devices are teql0, and it does
not set qdisc_priv(sch)->m leaving it zero on error path, then
qdisc_create() imediately calls teql_destroy() which does not expect
zero master pointer and we get OOPS.

Signed-off-by: Pavel Tikhomirov 
---


This makes sense, thanks !

Reviewed-by: Eric Dumazet 


Thanks!



I would think bug origin is

Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")

Can you confirm you have this backported to 3.10.0-1062.7.1.el7.x86_64 ?



According to our source copy it looks backported to 1062.7.1, please see:
https://src.openvz.org/projects/OVZ/repos/vzkernel/browse/net/sched/sch_api.c?at=refs%2Ftags%2Frh7-3.10.0-1062.7.1.el7#1167

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


[PATCH] net: sched: sch_teql: fix null-pointer dereference

2021-04-08 Thread Pavel Tikhomirov
Reproduce:

  modprobe sch_teql
  tc qdisc add dev teql0 root teql0

This leads to (for instance in Centos 7 VM) OOPS:

[  532.366633] BUG: unable to handle kernel NULL pointer dereference at 
00a8
[  532.366733] IP: [] teql_destroy+0x18/0x100 [sch_teql]
[  532.366825] PGD 8001376d5067 PUD 137e37067 PMD 0
[  532.366906] Oops:  [#1] SMP
[  532.366987] Modules linked in: sch_teql ...
[  532.367945] CPU: 1 PID: 3026 Comm: tc Kdump: loaded Tainted: G   
 T 3.10.0-1062.7.1.el7.x86_64 #1
[  532.368041] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.2 04/01/2014
[  532.368125] task: 8b7d37d31070 ti: 8b7c9fdbc000 task.ti: 
8b7c9fdbc000
[  532.368224] RIP: 0010:[]  [] 
teql_destroy+0x18/0x100 [sch_teql]
[  532.368320] RSP: 0018:8b7c9fdbf8e0  EFLAGS: 00010286
[  532.368394] RAX: c0612490 RBX: 8b7cb1565e00 RCX: 8b7d35ba2000
[  532.368476] RDX: 8b7d35ba2000 RSI:  RDI: 8b7cb1565e00
[  532.368557] RBP: 8b7c9fdbf8f8 R08: 8b7d3fd1f140 R09: 8b7d3b001600
[  532.368638] R10: 8b7d3b001600 R11: 84c7d65b R12: ffd8
[  532.368719] R13: 8000 R14: 8b7d35ba2000 R15: 8b7c9fdbf9a8
[  532.368800] FS:  7f6a4e872740() GS:8b7d3fd0() 
knlGS:
[  532.368885] CS:  0010 DS:  ES:  CR0: 80050033
[  532.368961] CR2: 00a8 CR3: 0001396ee000 CR4: 000206e0
[  532.369046] Call Trace:
[  532.369159]  [] qdisc_create+0x36e/0x450
[  532.369268]  [] ? ns_capable+0x29/0x50
[  532.369366]  [] ? nla_parse+0x32/0x120
[  532.369442]  [] tc_modify_qdisc+0x13c/0x610
[  532.371508]  [] rtnetlink_rcv_msg+0xa7/0x260
[  532.372668]  [] ? sock_has_perm+0x75/0x90
[  532.373790]  [] ? rtnl_newlink+0x890/0x890
[  532.374914]  [] netlink_rcv_skb+0xab/0xc0
[  532.376055]  [] rtnetlink_rcv+0x28/0x30
[  532.377204]  [] netlink_unicast+0x170/0x210
[  532.378333]  [] netlink_sendmsg+0x308/0x420
[  532.379465]  [] sock_sendmsg+0xb6/0xf0
[  532.380710]  [] ? __xfs_filemap_fault+0x8e/0x1d0 [xfs]
[  532.381868]  [] ? xfs_filemap_fault+0x2c/0x30 [xfs]
[  532.383037]  [] ? __do_fault.isra.61+0x8a/0x100
[  532.384144]  [] ___sys_sendmsg+0x3e9/0x400
[  532.385268]  [] ? handle_mm_fault+0x39d/0x9b0
[  532.386387]  [] ? __do_page_fault+0x238/0x500
[  532.387472]  [] __sys_sendmsg+0x51/0x90
[  532.388560]  [] SyS_sendmsg+0x12/0x20
[  532.389636]  [] system_call_fastpath+0x25/0x2a
[  532.390704]  [] ? system_call_after_swapgs+0xae/0x146
[  532.391753] Code: 00 00 00 00 00 00 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 
66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b b7 48 01 00 00 48 89 fb <48> 8b 
8e a8 00 00 00 48 85 c9 74 43 48 89 ca eb 0f 0f 1f 80 00
[  532.394036] RIP  [] teql_destroy+0x18/0x100 [sch_teql]
[  532.395127]  RSP 
[  532.396179] CR2: 00a8

Null pointer dereference happens on master->slaves dereference in
teql_destroy() as master is null-pointer.

When qdisc_create() calls teql_qdisc_init() it imediately fails after
check "if (m->dev == dev)" because both devices are teql0, and it does
not set qdisc_priv(sch)->m leaving it zero on error path, then
qdisc_create() imediately calls teql_destroy() which does not expect
zero master pointer and we get OOPS.

Signed-off-by: Pavel Tikhomirov 
---
 net/sched/sch_teql.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index 2f1f0a378408..6af6b95bdb67 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -134,6 +134,9 @@ teql_destroy(struct Qdisc *sch)
struct teql_sched_data *dat = qdisc_priv(sch);
struct teql_master *master = dat->m;
 
+   if (!master)
+   return;
+
prev = master->slaves;
if (prev) {
do {
-- 
2.30.2



[PATCH] clone3: add option to change owner of newly created namespaces

2021-04-02 Thread Pavel Tikhomirov
Let's add a flag CLONE_OWNER_NS and clone_args.userns_fd field to
specify a user namespace for clone3 which would become an owner of newly
created namespaces. This owner is restricted to be a descendant of
current user namespace. It means that we can do clone while in more
privileged user namespace than the one which would become an owner.

We need this for CRIU.

1) When CRIU is dumping a container it can face nested user namespaces
and pid namespaces, to properly dump/restore them CRIU needs to also
restore all dependencies between namespaces: parent-child and
userns-owner.

Previously when CRIU was recreating the process tree of container during
restore if we needed to restore a pid namespace init we should first
enter a proper user namespace which should be an owner of restored pid
namespace and only than can clone. That means that the cloned process is
initially created in probably unprivileged user namespace and it brings
a lot of restrictions on what it can do (e.g. we should probably enter
it's ipc/uts/net namespaces already, if they are owned by more
privileged user namespace we would not be able to enter them later).

With new userns_fd option we would be able to recreate process tree with
all container pid namespaces with proper user namespace owners for them
while all processes are left in most privileged user namespace, and it
would be easier to restore all resources of those processes. We can
restore user namespace of each process later when needed.

2) Other problem which this option is trying to solve is that clone3()
with set_tid does not work as desired when we try to recreate a process
in the container which has nested user and pid namespaces.

Imagine that in container we have a chain of processes each of them was
created with clone() with (CLONE_NEWPID | CLONE_NEWUSER) from previous
process and in each pid namespace of this chain we've also created some
amount of processes to hold pid numbers. Next we create one more
"target" process with new pid and user namespaces in the end of chain,
it can have random pids in each pid namespace of the chain.

When CRIU would restore this container it would not be able to restore
pids of "target" process on each pid namespace level with clone3()+
set_tid because it would need to call clone from owner user namespace
(or it's parent if we use CLONE_NEWUSER) of the "target"'s pid
namespace, which has no rights to set_tid on each needed level.

With new userns_fd option we would easily do it, we just need to have
current user namespace to be root user namespace of the container and
pass "target"'s pid namespace owner user namespace to userns_fd.

Here are two examples on the use of new userns_fd option:

- clone3_owner_ns.c - is simple demonstration of how process can create
pid namespace owned by it's user namespace descendant;
- clone3_set_tid_vs_owner_ns.c - is a bit more complex demonstration
on how clone3+set_tid can work for restoring pids on each level of
nested user and pid namespaces when used together with userns_fd:

https://github.com/Snorch/clone3_owner_ns

Signed-off-by: Pavel Tikhomirov 
---
 include/linux/nsproxy.h|  3 ++-
 include/linux/sched/task.h |  1 +
 include/linux/user_namespace.h |  6 ++
 include/uapi/linux/sched.h |  3 +++
 kernel/fork.c  | 18 +++---
 kernel/nsproxy.c   | 19 ++-
 kernel/user_namespace.c| 22 ++
 7 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index cdb171efc7cb..201bbb75637d 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -91,7 +91,8 @@ static inline struct cred *nsset_cred(struct nsset *set)
  *
  */
 
-int copy_namespaces(unsigned long flags, struct task_struct *tsk);
+int copy_namespaces(unsigned long flags, struct task_struct *tsk,
+   int userns_fd);
 void exit_task_namespaces(struct task_struct *tsk);
 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
 void free_nsproxy(struct nsproxy *ns);
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index c0f71f2e7160..176b087443a0 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -33,6 +33,7 @@ struct kernel_clone_args {
int cgroup;
struct cgroup *cgrp;
struct css_set *cset;
+   int userns_fd;
 };
 
 /*
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..cecc4c55b7cb 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -136,6 +136,7 @@ extern bool in_userns(const struct user_namespace *ancestor,
   const struct user_namespace *child);
 extern bool current_in_userns(const struct user_namespace *target_ns);
 struct ns_common *ns_get_owner(struct ns_common *ns);
+extern struct user_namespac

Re: [PATCH] move_mount: allow to add a mount into an existing group

2021-03-29 Thread Pavel Tikhomirov




On 3/29/21 12:47 AM, Andrei Vagin wrote:

On Thu, Mar 25, 2021 at 03:14:44PM +0300, Pavel Tikhomirov wrote:

Previously a sharing group (shared and master ids pair) can be only
inherited when mount is created via bindmount. This patch adds an
ability to add an existing private mount into an existing sharing group.

With this functionality one can first create the desired mount tree from
only private mounts (without the need to care about undesired mount
propagation or mount creation order implied by sharing group
dependencies), and next then setup any desired mount sharing between
those mounts in tree as needed.

This allows CRIU to restore any set of mount namespaces, mount trees and
sharing group trees for a container.

We have many issues with restoring mounts in CRIU related to sharing
groups and propagation:
- reverse sharing groups vs mount tree order requires complex mounts
   reordering which mostly implies also using some temporary mounts
(please see https://lkml.org/lkml/2021/3/23/569 for more info)

- mount() syscall creates tons of mounts due to propagation
- mount re-parenting due to propagation
- "Mount Trap" due to propagation
- "Non Uniform" propagation, meaning that with different tricks with
   mount order and temporary children-"lock" mounts one can create mount
   trees which can't be restored without those tricks
(see https://www.linuxplumbersconf.org/event/7/contributions/640/)

With this new functionality we can resolve all the problems with
propagation at once.



Thanks for picking this up. Overall it looks good for me. Here is one
comment inline.


Cc: Eric W. Biederman 
Cc: Alexander Viro 
Cc: Andrei Vagin 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: lkml 
Signed-off-by: Pavel Tikhomirov 

---
This is a rework of "mnt: allow to add a mount into an existing group"
patch from Andrei. https://lkml.org/lkml/2017/4/28/20

New do_set_group is similar to do_move_mount, but with many restrictions
of do_move_mount removed and that's why:

1) Allow "cross-namespace" sharing group set. If we allow operation only
with mounts from current+anon mount namespace one would still be able to
setns(from_mntns) + open_tree(from, OPEN_TREE_CLONE) + setns(to_mntns) +
move_mount(anon, to, MOVE_MOUNT_SET_GROUP) to set sharing group to mount
in different mount namespace with source mount. But with this approach
we would need to create anon mount namespace and mount copy each time,
which is just a waste of resources. So instead lets just check if we are
allowed to modify both mount namespaces (which looks equivalent to what
setns-es and open_tree check).

2) Allow operating on non-root dentry of the mount. As if we prohibit it
this would require extra care from CRIU side in places where we wan't to
copy sharing group from mount on host (for external mounts) and user
gives us path to non-root dentry. I don't see any problem with
referencing mount with any dentry for sharing group setting. Also there
is no problem with referencing one by file and one by directory.

3) Also checks wich only apply to actually moving mount which we have in
do_move_mount and open_tree are skipped. We don't need to check
MNT_LOCKED, unbindable, nsfs loops and ancestor relation as we don't
move mounts.

Security note: there would be no (new) loops in sharing groups tree,
because this new move_mount(MOVE_MOUNT_SET_GROUP) operation only adds
one _private_ mount to one group (without moving between groups), the
sharing groups tree itself stays unchanged after it.

In Virtuozzo we have "mount-v2" implementation, based with the original
kernel patch from Andrei, tested for almost a year and it actually
decreased number of bugs with mounts a lot. One can take a look on the
implementation of sharing group restore in CRIU in "mount-v2" here:

https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#898

This works almost the same with current version of patch if we replace
mount(MS_SET_GROUP) to move_mount(MOVE_MOUNT_SET_GROUP), please see
super-draft port for mainstream criu, this at least passes
non-user-namespaced mount tests (zdtm.py --mounts-v2 -f ns).

https://github.com/Snorch/criu/commits/mount-v2-poc

---
  fs/namespace.c | 57 +-
  include/uapi/linux/mount.h |  3 +-
  2 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9d33909d0f9e..ab439d8510dd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2660,6 +2660,58 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
return ret;
  }
  
+static int do_set_group(struct path *from_path, struct path *to_path)

+{
+   struct mount *from, *to;
+   int err;
+
+   from = real_mount(from_path->mnt);
+   to = real_mount(to_path->mnt);
+
+   namespace_lock();
+
+   err = -EINVAL;
+   /* To and From must be 

[PATCH] move_mount: allow to add a mount into an existing group

2021-03-25 Thread Pavel Tikhomirov
Previously a sharing group (shared and master ids pair) can be only
inherited when mount is created via bindmount. This patch adds an
ability to add an existing private mount into an existing sharing group.

With this functionality one can first create the desired mount tree from
only private mounts (without the need to care about undesired mount
propagation or mount creation order implied by sharing group
dependencies), and next then setup any desired mount sharing between
those mounts in tree as needed.

This allows CRIU to restore any set of mount namespaces, mount trees and
sharing group trees for a container.

We have many issues with restoring mounts in CRIU related to sharing
groups and propagation:
- reverse sharing groups vs mount tree order requires complex mounts
  reordering which mostly implies also using some temporary mounts
(please see https://lkml.org/lkml/2021/3/23/569 for more info)

- mount() syscall creates tons of mounts due to propagation
- mount re-parenting due to propagation
- "Mount Trap" due to propagation
- "Non Uniform" propagation, meaning that with different tricks with
  mount order and temporary children-"lock" mounts one can create mount
  trees which can't be restored without those tricks
(see https://www.linuxplumbersconf.org/event/7/contributions/640/)

With this new functionality we can resolve all the problems with
propagation at once.

Cc: Eric W. Biederman 
Cc: Alexander Viro 
Cc: Andrei Vagin 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: lkml 
Signed-off-by: Pavel Tikhomirov 

---
This is a rework of "mnt: allow to add a mount into an existing group"
patch from Andrei. https://lkml.org/lkml/2017/4/28/20

New do_set_group is similar to do_move_mount, but with many restrictions
of do_move_mount removed and that's why:

1) Allow "cross-namespace" sharing group set. If we allow operation only
with mounts from current+anon mount namespace one would still be able to
setns(from_mntns) + open_tree(from, OPEN_TREE_CLONE) + setns(to_mntns) +
move_mount(anon, to, MOVE_MOUNT_SET_GROUP) to set sharing group to mount
in different mount namespace with source mount. But with this approach
we would need to create anon mount namespace and mount copy each time,
which is just a waste of resources. So instead lets just check if we are
allowed to modify both mount namespaces (which looks equivalent to what
setns-es and open_tree check).

2) Allow operating on non-root dentry of the mount. As if we prohibit it
this would require extra care from CRIU side in places where we wan't to
copy sharing group from mount on host (for external mounts) and user
gives us path to non-root dentry. I don't see any problem with
referencing mount with any dentry for sharing group setting. Also there
is no problem with referencing one by file and one by directory.

3) Also checks wich only apply to actually moving mount which we have in
do_move_mount and open_tree are skipped. We don't need to check
MNT_LOCKED, unbindable, nsfs loops and ancestor relation as we don't
move mounts.

Security note: there would be no (new) loops in sharing groups tree,
because this new move_mount(MOVE_MOUNT_SET_GROUP) operation only adds
one _private_ mount to one group (without moving between groups), the
sharing groups tree itself stays unchanged after it.

In Virtuozzo we have "mount-v2" implementation, based with the original
kernel patch from Andrei, tested for almost a year and it actually
decreased number of bugs with mounts a lot. One can take a look on the
implementation of sharing group restore in CRIU in "mount-v2" here:

https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#898

This works almost the same with current version of patch if we replace
mount(MS_SET_GROUP) to move_mount(MOVE_MOUNT_SET_GROUP), please see
super-draft port for mainstream criu, this at least passes
non-user-namespaced mount tests (zdtm.py --mounts-v2 -f ns).

https://github.com/Snorch/criu/commits/mount-v2-poc

---
 fs/namespace.c | 57 +-
 include/uapi/linux/mount.h |  3 +-
 2 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9d33909d0f9e..ab439d8510dd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2660,6 +2660,58 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
return ret;
 }
 
+static int do_set_group(struct path *from_path, struct path *to_path)
+{
+   struct mount *from, *to;
+   int err;
+
+   from = real_mount(from_path->mnt);
+   to = real_mount(to_path->mnt);
+
+   namespace_lock();
+
+   err = -EINVAL;
+   /* To and From must be mounted */
+   if (!is_mounted(&from->mnt))
+   goto out;
+   if (!is_mounted(&to->mnt))
+   goto out;
+
+   err = -EPERM;
+   /* We should be allowed to modif

Re: [CRIU] [PATCH] mnt: allow to add a mount into an existing group

2021-03-24 Thread Pavel Tikhomirov

Adding Andrew to CC with the right email.

On 3/23/21 3:59 PM, Pavel Tikhomirov wrote:

Hi! Can we restart the discussion on this topic?

In CRIU we need to be able to dump/restore all mount trees of system 
container (CT). CT can have anything inside - users which create their 
custom mounts configuration, systemd with custom mount namespaces for 
it's services, nested application containers inside the CT with their 
own mount namespaces, and all mounts in CT mount trees can be grouped by 
sharing groupes (e.g. same shared_id + master_id pair), and those groups 
can depend one from another forming a tree structure of sharing groups.


1) Imagine that we have this sharing group tree (in format (shared_id, 
master_id), 0 means no sharing, we don't care about actual mounts for 
now only master-slave dependencies between sharing groups):


(1,0)
   |- (2,1)
   |- (3,1)
    |- (4,3)
     |- (0,4)

The main problem of restoring mounts is the fact that sharing groups 
currently can be only inherited, e.g. if you have one mount (first) with 
shared_id = x, master_id = y, the only way to get another mount with 
(x,y) is to create a bindmount from the first mount. Also to create 
mount (y,z) from mount (x,y) one should also first inherit (x,y) via 
bindmount and than change to (y,z).


This means that mentioned above tree puts restriction on the mounts 
creation order, one need to have at least one mount for each of sharing 
groups (1,0), (3,1) and (4,3) before creating the first mount of the 
sharing group (0,4).


But what if we want to mount (restore) actual mounts in this mount tree 
"reverse" order:


mntid    parent    mountpoint    (shared_id, master_id)
101    0    /tmp    (0,4)
102    101    /tmp    (4,3)
103    102    /tmp    (3,1)
104    103    /tmp    (1,0)

Mount 104's sharing group should be created before mount 101, 102 and 
103 sharing groups, but mount 104 should be created after those mounts. 
One can actually prepare this setup (on mainstream kernel) by 
pre-creating sharing groups elsewhere and then binding to /tmp in proper 
order with careful unmounting of propagations (see test.sh attached):


[root@snorch propagation-tests]# bash ../test.sh

960 1120 0:56 / /tmp/propagation-tests/tmp rw,relatime master:452 - 
tmpfs propagation-tests-src rw,inode64
958 960 0:56 / /tmp/propagation-tests/tmp/sub rw,relatime shared:452 
master:451 - tmpfs propagation-tests-src rw,inode64
961 958 0:56 / /tmp/propagation-tests/tmp/sub/sub rw,relatime shared:451 
master:433 - tmpfs propagation-tests-src rw,inode64
963 961 0:56 / /tmp/propagation-tests/tmp/sub/sub/sub rw,relatime 
shared:433 - tmpfs propagation-tests-src rw,inode64



But this "pre-creating" from test.sh is not universal at all and only 
works for this simple case. CRIU does not know anything about the 
history of mount creation for system container, it also does not know 
anything about any temporary mounts which were used and then removed. So 
understanding the proper order is almost impossible like Andrew says.


I've also prepared a presentation on Linux Plumbers last year about how 
much problems propagation brings to mounts restore in CRIU, you can take 
a look here https://www.linuxplumbersconf.org/event/7/contributions/640/


2) Propagation creates tons of mounts
3) Mount reparenting
4) "Mount trap"
5) "Non-uniform" propagation
6) “Cross-namespace” sharing groups

Allowing to create mounts private first and create sharing groups later 
and copy sharing groups later instead of inheriting them resolves all 
the problems with propagation at once.


One can take a look on the implementation of sharing group restore in 
CRIU if we have this (mnt: allow to add a mount into an existing group) 
patch applied: 
https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L898 



Obviously this does not solve all the problems with mounts I know about 
but it's a big step forward in properly supporting them in CRIU. We 
already have this tested in Virtuozzo for almost a year and it works nice.


Notes:

- There is another idea, but I should say early that I don't like it, 
because with it restoring mounts with criu would be still super complex. 
We can add extra flag to mount/move_mount syscall to disable propagation 
temporary so that CRIU can restore the mount tree without problems 2-5, 
also we can now create cross-namespace bindmounts with 
(copy_tree+move_mount) to solve 6. But this solution does not help much 
with problem 1 - ordering and the need of temporary mounts. As you can 
see in test.sh you would still need to think hard to solve different 
similar configurations of reverse order between mounts and sharing groups.


- We can actually prohibit cross-namespace MS_SET_GROUP if you like. (If 
both namespaces are non abstract.) We can use open_tree to create a copy 
of the mount with 

Re: [CRIU] [PATCH] mnt: allow to add a mount into an existing group

2021-03-23 Thread Pavel Tikhomirov

Hi! Can we restart the discussion on this topic?

In CRIU we need to be able to dump/restore all mount trees of system 
container (CT). CT can have anything inside - users which create their 
custom mounts configuration, systemd with custom mount namespaces for 
it's services, nested application containers inside the CT with their 
own mount namespaces, and all mounts in CT mount trees can be grouped by 
sharing groupes (e.g. same shared_id + master_id pair), and those groups 
can depend one from another forming a tree structure of sharing groups.


1) Imagine that we have this sharing group tree (in format (shared_id, 
master_id), 0 means no sharing, we don't care about actual mounts for 
now only master-slave dependencies between sharing groups):


(1,0)
  |- (2,1)
  |- (3,1)
   |- (4,3)
|- (0,4)

The main problem of restoring mounts is the fact that sharing groups 
currently can be only inherited, e.g. if you have one mount (first) with 
shared_id = x, master_id = y, the only way to get another mount with 
(x,y) is to create a bindmount from the first mount. Also to create 
mount (y,z) from mount (x,y) one should also first inherit (x,y) via 
bindmount and than change to (y,z).


This means that mentioned above tree puts restriction on the mounts 
creation order, one need to have at least one mount for each of sharing 
groups (1,0), (3,1) and (4,3) before creating the first mount of the 
sharing group (0,4).


But what if we want to mount (restore) actual mounts in this mount tree 
"reverse" order:


mntid   parent  mountpoint  (shared_id, master_id)
101 0   /tmp(0,4)
102 101 /tmp(4,3)
103 102 /tmp(3,1)
104 103 /tmp(1,0)

Mount 104's sharing group should be created before mount 101, 102 and 
103 sharing groups, but mount 104 should be created after those mounts. 
One can actually prepare this setup (on mainstream kernel) by 
pre-creating sharing groups elsewhere and then binding to /tmp in proper 
order with careful unmounting of propagations (see test.sh attached):


[root@snorch propagation-tests]# bash ../test.sh

960 1120 0:56 / /tmp/propagation-tests/tmp rw,relatime master:452 - 
tmpfs propagation-tests-src rw,inode64
958 960 0:56 / /tmp/propagation-tests/tmp/sub rw,relatime shared:452 
master:451 - tmpfs propagation-tests-src rw,inode64
961 958 0:56 / /tmp/propagation-tests/tmp/sub/sub rw,relatime shared:451 
master:433 - tmpfs propagation-tests-src rw,inode64
963 961 0:56 / /tmp/propagation-tests/tmp/sub/sub/sub rw,relatime 
shared:433 - tmpfs propagation-tests-src rw,inode64



But this "pre-creating" from test.sh is not universal at all and only 
works for this simple case. CRIU does not know anything about the 
history of mount creation for system container, it also does not know 
anything about any temporary mounts which were used and then removed. So 
understanding the proper order is almost impossible like Andrew says.


I've also prepared a presentation on Linux Plumbers last year about how 
much problems propagation brings to mounts restore in CRIU, you can take 
a look here https://www.linuxplumbersconf.org/event/7/contributions/640/


2) Propagation creates tons of mounts
3) Mount reparenting
4) "Mount trap"
5) "Non-uniform" propagation
6) “Cross-namespace” sharing groups

Allowing to create mounts private first and create sharing groups later 
and copy sharing groups later instead of inheriting them resolves all 
the problems with propagation at once.


One can take a look on the implementation of sharing group restore in 
CRIU if we have this (mnt: allow to add a mount into an existing group) 
patch applied: 
https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L898


Obviously this does not solve all the problems with mounts I know about 
but it's a big step forward in properly supporting them in CRIU. We 
already have this tested in Virtuozzo for almost a year and it works nice.


Notes:

- There is another idea, but I should say early that I don't like it, 
because with it restoring mounts with criu would be still super complex. 
We can add extra flag to mount/move_mount syscall to disable propagation 
temporary so that CRIU can restore the mount tree without problems 2-5, 
also we can now create cross-namespace bindmounts with 
(copy_tree+move_mount) to solve 6. But this solution does not help much 
with problem 1 - ordering and the need of temporary mounts. As you can 
see in test.sh you would still need to think hard to solve different 
similar configurations of reverse order between mounts and sharing groups.


- We can actually prohibit cross-namespace MS_SET_GROUP if you like. (If 
both namespaces are non abstract.) We can use open_tree to create a copy 
of the mount with the same sharing group and only then copy sharing from 
the copy while being in proper mountns.


- We still need it:

> this code might be made

Re: [PATCH] fcntl: make F_GETOWN(EX) return 0 on dead owner task

2021-02-08 Thread Pavel Tikhomirov




On 2/8/21 3:31 PM, Jeff Layton wrote:

On Thu, 2021-02-04 at 01:17 +0300, Cyrill Gorcunov wrote:

On Thu, Feb 04, 2021 at 12:35:42AM +0300, Pavel Tikhomirov wrote:


AFAICS if pid is held only by 1) fowner refcount and by 2) single process
(without threads, group and session for simplicity), on process exit we go
through:

do_exit
   exit_notify
 release_task
   __exit_signal
 __unhash_process
   detach_pid
 __change_pid
   free_pid
 idr_remove

So pid is removed from idr, and after that alloc_pid can reuse pid numbers
even if old pid structure is still alive and is still held by fowner.

...

Hope this answers your question, Thanks!


Yeah, indeed, thanks! So the change is sane still I'm
a bit worried about backward compatibility, gimme some
time I'll try to refresh my memory first, in a couple
of days or weekend (though here are a number of experienced
developers CC'ed maybe they reply even faster).


I always find it helpful to refer to the POSIX spec [1] for this sort of
thing. In this case, it says:

F_GETOWN
 If fildes refers to a socket, get the process ID or process group ID
specified to receive SIGURG signals when out-of-band data is available.
Positive values shall indicate a process ID; negative values, other than
-1, shall indicate a process group ID; the value zero shall indicate
that no SIGURG signals are to be sent. If fildes does not refer to a
socket, the results are unspecified.

In the event that the PID is reused, the kernel won't send signals to
the replacement task, correct?


Correct. Looks like only places to send signal to owner are send_sigio() 
and send_sigurg() (at least nobody else dereferences fown->pid_type). 
And in both places we lookup for task to send signal to with pid_task() 
or do_each_pid_task() (similar to what I do in patch) and will not find 
any task if pid was reused. Thus no signal would be sent.



Assuming that's the case, then this patch
looks fine to me too. I'll plan to pick it for linux-next later today,
and we can hopefully get this into v5.12.

[1]: 
https://pubs.opengroup.org/onlinepubs/9699919799/functions/fcntl.html#tag_16_122



Thanks for finding it!

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] fcntl: make F_GETOWN(EX) return 0 on dead owner task

2021-02-03 Thread Pavel Tikhomirov

On 2/3/21 10:32 PM, Cyrill Gorcunov wrote:

On Wed, Feb 03, 2021 at 03:41:56PM +0300, Pavel Tikhomirov wrote:

Currently there is no way to differentiate the file with alive owner
from the file with dead owner but pid of the owner reused. That's why
CRIU can't actually know if it needs to restore file owner or not,
because if it restores owner but actual owner was dead, this can
introduce unexpected signals to the "false"-owner (which reused the
pid).


Hi! Thanks for the patch. You know I manage to forget the fowner internals.
Could you please enlighten me -- when owner is set with some pid we do

f_setown_ex
   __f_setown
 f_modown
   filp->f_owner.pid = get_pid(pid);

Thus pid get refcount incremented.


Hi, and yes you are right about refcount is held.

 Then the owner exits but refcounter

should be still up and running and pid should not be reused, no? Or
I miss something obvious?


AFAICS if pid is held only by 1) fowner refcount and by 2) single 
process (without threads, group and session for simplicity), on process 
exit we go through:


do_exit
  exit_notify
release_task
  __exit_signal
__unhash_process
  detach_pid
__change_pid
  free_pid
idr_remove

So pid is removed from idr, and after that alloc_pid can reuse pid 
numbers even if old pid structure is still alive and is still held by 
fowner.


Also I've added criu-zdtm test which reproduces the problem:

https://src.openvz.org/projects/OVZ/repos/criu/commits/e25904c35dbc535f6837e55da58ca0f5a5caf4b3#test/zdtm/static/file_fown_reuse.c

Hope this answers your question, Thanks!



The patch itself looks ok on a first glance.



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


[PATCH] fcntl: make F_GETOWN(EX) return 0 on dead owner task

2021-02-03 Thread Pavel Tikhomirov
Currently there is no way to differentiate the file with alive owner
from the file with dead owner but pid of the owner reused. That's why
CRIU can't actually know if it needs to restore file owner or not,
because if it restores owner but actual owner was dead, this can
introduce unexpected signals to the "false"-owner (which reused the
pid).

Let's change the api, so that F_GETOWN(EX) returns 0 in case actual
owner is dead already.

Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Cyrill Gorcunov 
Cc: Andrei Vagin 
Signed-off-by: Pavel Tikhomirov 
---
 fs/fcntl.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 05b36b28f2e8..483ef8861376 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -148,11 +148,15 @@ void f_delown(struct file *filp)
 
 pid_t f_getown(struct file *filp)
 {
-   pid_t pid;
+   pid_t pid = 0;
read_lock(&filp->f_owner.lock);
-   pid = pid_vnr(filp->f_owner.pid);
-   if (filp->f_owner.pid_type == PIDTYPE_PGID)
-   pid = -pid;
+   rcu_read_lock();
+   if (pid_task(filp->f_owner.pid, filp->f_owner.pid_type)) {
+   pid = pid_vnr(filp->f_owner.pid);
+   if (filp->f_owner.pid_type == PIDTYPE_PGID)
+   pid = -pid;
+   }
+   rcu_read_unlock();
read_unlock(&filp->f_owner.lock);
return pid;
 }
@@ -200,11 +204,14 @@ static int f_setown_ex(struct file *filp, unsigned long 
arg)
 static int f_getown_ex(struct file *filp, unsigned long arg)
 {
struct f_owner_ex __user *owner_p = (void __user *)arg;
-   struct f_owner_ex owner;
+   struct f_owner_ex owner = {};
int ret = 0;
 
read_lock(&filp->f_owner.lock);
-   owner.pid = pid_vnr(filp->f_owner.pid);
+   rcu_read_lock();
+   if (pid_task(filp->f_owner.pid, filp->f_owner.pid_type))
+   owner.pid = pid_vnr(filp->f_owner.pid);
+   rcu_read_unlock();
switch (filp->f_owner.pid_type) {
case PIDTYPE_PID:
owner.type = F_OWNER_TID;
-- 
2.26.2



[PATCH v5 2/2] ovl: introduce new "uuid=off" option for inodes index feature

2020-10-13 Thread Pavel Tikhomirov
This replaces uuid with null in overlayfs file handles and thus relaxes
uuid checks for overlay index feature. It is only possible in case there
is only one filesystem for all the work/upper/lower directories and bare
file handles from this backing filesystem are unique. In other case when
we have multiple filesystems lets just fallback to "uuid=on" which is
and equivalent of how it worked before with all uuid checks.

This is needed when overlayfs is/was mounted in a container with index
enabled (e.g.: to be able to resolve inotify watch file handles on it to
paths in CRIU), and this container is copied and started alongside with
the original one. This way the "copy" container can't have the same uuid
on the superblock and mounting the overlayfs from it later would fail.

That is an example of the problem on top of loop+ext4:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
  #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
  #mount: /loop-test/loop-mp/merged:
  #mount(2) system call failed: Stale file handle.

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
create the copy of container and we require fs uuid to be unique for a new
container.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 

---
v2: in v1 I missed actual uuid check skip
v3: rebase to overlayfs-next, replace uuid with null in file handles,
split ovl_fs propagation to function arguments to separate patch, add
separate bool "uuid=on/off" option, move numfs check up, add doc note.
v4: get rid of double negatives, remove nouuid leftower comment, fix
misprint in kernel config name
v5: fix typos; remove config option, module param, ovl_uuid_def and the
corresponding notes

Signed-off-by: Pavel Tikhomirov 
---
 Documentation/filesystems/overlayfs.rst |  5 +
 fs/overlayfs/copy_up.c  |  3 ++-
 fs/overlayfs/namei.c|  4 +++-
 fs/overlayfs/ovl_entry.h|  1 +
 fs/overlayfs/super.c| 20 
 5 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst 
b/Documentation/filesystems/overlayfs.rst
index 580ab9a0fe31..a3e588dc8437 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -563,6 +563,11 @@ This verification may cause significant overhead in some 
cases.
 Note: the mount options index=off,nfs_export=on are conflicting for a
 read-write mount and will result in an error.
 
+Note: the mount option uuid=off can be used to replace UUID of the underlying
+filesystem in file handles with null, and effectively disable UUID checks. This
+can be useful in case the underlying disk is copied and the UUID of this copy
+is changed. This is only applicable if all lower/upper/work directories are on
+the same filesystem, otherwise it will fallback to normal behaviour.
 
 Volatile mount
 --
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 3380039036d6..0b7e7a90a435 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -320,7 +320,8 @@ struct ovl_fh *ovl_encode_real_fh(struct ovl_fs *ofs, 
struct dentry *real,
if (is_upper)
fh->fb.flags |= OVL_FH_FLAG_PATH_UPPER;
fh->fb.len = sizeof(fh->fb) + buflen;
-   fh->fb.uuid = *uuid;
+   if (ofs->config.uuid)
+   fh->fb.uuid = *uuid;
 
return fh;
 
diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index f058bf8e8b87..f731eb4d35f9 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -159,8 +159,10 @@ struct dentry *ovl_decode_real_fh(struct ovl_fs *ofs, 
struct ovl_fh *fh,
/*
 * Make sure that the stored uuid matches the uuid of the lower
 * layer where file handle will be decoded.
+* In case of uuid=off option just make sure that stored uuid is null.
 */
-   if (!uuid_equal(&fh->fb.uuid, &mnt->mnt_sb->s_uuid))
+   if (ofs->config.uuid ? !uuid_equal(&fh->fb.uuid, &mnt->mnt_sb->s_uuid) :
+ !uuid_is_null(&fh->fb.uuid))
return NULL;
 
bytes = (fh->fb.len - offsetof(struct ovl_fb, fid));
diff --git a/fs/overlay

[PATCH v5 1/2] ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh

2020-10-13 Thread Pavel Tikhomirov
This will be used in next patch to be able to change uuid checks and
add uuid nullification based on ofs->config.index for a new "uuid=off"
mode.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Reviewed-by: Amir Goldstein 
Signed-off-by: Pavel Tikhomirov 
---
 fs/overlayfs/copy_up.c   | 22 --
 fs/overlayfs/export.c| 10 ++
 fs/overlayfs/namei.c | 19 ++-
 fs/overlayfs/overlayfs.h | 14 --
 fs/overlayfs/util.c  |  3 ++-
 5 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 955ecd4030f0..3380039036d6 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -275,7 +275,8 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat 
*stat)
return err;
 }
 
-struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper)
+struct ovl_fh *ovl_encode_real_fh(struct ovl_fs *ofs, struct dentry *real,
+ bool is_upper)
 {
struct ovl_fh *fh;
int fh_type, dwords;
@@ -328,8 +329,8 @@ struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool 
is_upper)
return ERR_PTR(err);
 }
 
-int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
-  struct dentry *upper)
+int ovl_set_origin(struct ovl_fs *ofs, struct dentry *dentry,
+  struct dentry *lower, struct dentry *upper)
 {
const struct ovl_fh *fh = NULL;
int err;
@@ -340,7 +341,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry 
*lower,
 * up and a pure upper inode.
 */
if (ovl_can_decode_fh(lower->d_sb)) {
-   fh = ovl_encode_real_fh(lower, false);
+   fh = ovl_encode_real_fh(ofs, lower, false);
if (IS_ERR(fh))
return PTR_ERR(fh);
}
@@ -362,7 +363,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct 
dentry *upper,
const struct ovl_fh *fh;
int err;
 
-   fh = ovl_encode_real_fh(upper, true);
+   fh = ovl_encode_real_fh(ofs, upper, true);
if (IS_ERR(fh))
return PTR_ERR(fh);
 
@@ -380,6 +381,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct 
dentry *upper,
 static int ovl_create_index(struct dentry *dentry, struct dentry *origin,
struct dentry *upper)
 {
+   struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
struct dentry *indexdir = ovl_indexdir(dentry->d_sb);
struct inode *dir = d_inode(indexdir);
struct dentry *index = NULL;
@@ -402,7 +404,7 @@ static int ovl_create_index(struct dentry *dentry, struct 
dentry *origin,
if (WARN_ON(ovl_test_flag(OVL_INDEX, d_inode(dentry
return -EIO;
 
-   err = ovl_get_index_name(origin, &name);
+   err = ovl_get_index_name(ofs, origin, &name);
if (err)
return err;
 
@@ -411,7 +413,7 @@ static int ovl_create_index(struct dentry *dentry, struct 
dentry *origin,
if (IS_ERR(temp))
goto free_name;
 
-   err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), upper, temp);
+   err = ovl_set_upper_fh(ofs, upper, temp);
if (err)
goto out;
 
@@ -521,7 +523,7 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, 
struct dentry *temp)
 * hard link.
 */
if (c->origin) {
-   err = ovl_set_origin(c->dentry, c->lowerpath.dentry, temp);
+   err = ovl_set_origin(ofs, c->dentry, c->lowerpath.dentry, temp);
if (err)
return err;
}
@@ -700,7 +702,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c)
 static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 {
int err;
-   struct ovl_fs *ofs = c->dentry->d_sb->s_fs_info;
+   struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
bool to_index = false;
 
/*
@@ -722,7 +724,7 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 
if (to_index) {
c->destdir = ovl_indexdir(c->dentry->d_sb);
-   err = ovl_get_index_name(c->lowerpath.dentry, &c->destname);
+   err = ovl_get_index_name(ofs, c->lowerpath.dentry, 
&c->destname);
if (err)
return err;
} else if (WARN_ON(!c->parent)) {
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
index ed35be3fafc6..41ebf52f1bbc 100644
--- a/fs/overlayfs/export.c
+++ b/fs/overlayfs/export.c
@@ -211,7 +211,8 @@ static int ovl_check_encode_origin(struct dentry *dentry)
return 1;
 }
 
-static int ovl_dentry_to_fid(struct dentry *dentry, u32 *fid, int buflen)
+static int ovl_dentry_to_fid(struct ovl_fs *ofs, struct dentry *dentry,
+u32 *fid, int buflen)

[PATCH v5 0/2] ovl introduce "uuid=off"

2020-10-13 Thread Pavel Tikhomirov
This is a v5 of:
ovl: introduce new "index=nouuid" option for inodes index feature

Changes in v3: rebase to overlayfs-next, replace uuid with null in file
handles, propagate ovl_fs to needed functions in a separate patch, add
separate bool "uuid=on/off" option, fix numfs check fallback, add a note
to docs.

Changes in v4: get rid of double negatives, remove nouuid leftower
comment, fix missprint in kernel config name.

Changes in v5: fix typos; remove config option and module param.

Amir, as second patch had changed quiet a bit, I don't add you
reviewed-by to it.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 

Pavel Tikhomirov (2):
  ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh
  ovl: introduce new "uuid=off" option for inodes index feature

 Documentation/filesystems/overlayfs.rst |  5 +
 fs/overlayfs/copy_up.c  | 25 ++---
 fs/overlayfs/export.c   | 10 ++
 fs/overlayfs/namei.c| 23 +--
 fs/overlayfs/overlayfs.h| 14 --
 fs/overlayfs/ovl_entry.h|  1 +
 fs/overlayfs/super.c| 20 
 fs/overlayfs/util.c |  3 ++-
 8 files changed, 69 insertions(+), 32 deletions(-)

-- 
2.26.2



Re: [PATCH v4 2/2] ovl: introduce new "uuid=off" option for inodes index feature

2020-10-13 Thread Pavel Tikhomirov




On 10/6/20 6:13 PM, Miklos Szeredi wrote:

On Fri, Sep 25, 2020 at 10:35 AM Pavel Tikhomirov
 wrote:


Note: In our (Virtuozzo) use case users inside a container can create
"regular" overlayfs mounts without any "index=" option, but we still
want to migrate this containers with CRIU so we set "index=on" as kernel
default so that all the container overlayfs mounts get support of file
handles automatically. With "uuid=off" we want the same thing (to be
able to "copy" container with uuid change) - we would set kernel default
so that all the container overlayfs mounts get "uuid=off" automatically.


I'm not sure I buy that argument for a kernel option.   It should
rather be a "container" option in that case, but AFAIK the kernel
doesn't have a concept of a container.  I think this needs to be
discussed on the relevant mailing lists.

As of now mainline kernel doesn't support unprivileged overlay mounts,
so I guess this is not an issue.  Let's just merge this without the
kernel and the module options.


Virtuozzo kernel does have a "container" concept and we do have 
unprivileged overlay mounts to support docker inside Virtuozzo 
containers. We don't face any major issues with it. But you are right 
it's not mainstream.


Probably a normal user of mainstream kernel also might want to set 
index=on+uuid=off by default, so that all their docker containters 
automatically support inotifies and survive backing disk uuid change 
automaticaly.


I will prepare next patchset version without default.



Thanks,
Miklos



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH 0/4] fs: add mount_setattr()

2020-10-01 Thread Pavel Tikhomirov





mount_setattr() can be expected to grow over time and is designed with
extensibility in mind. It follows the extensible syscall pattern we have
used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:

struct mount_attr {
__u64 attr_set;
__u64 attr_clr;
__u32 propagation;
__u32 atime;
};



We probably can rework "mnt: allow to add a mount into an existing 
group" (MS_SET_GROUP https://lkml.org/lkml/2017/1/23/712) to an 
extension mount_setattr. Do anyone have any objections?


We need it in CRIU because it is a big problem to restore complex mount 
trees with complex propagation flags (see my LPC talk for details 
https://linuxplumbersconf.org/event/7/contributions/640/) of 
system-containers.


If we allow set(copy) sharing options we can separate mount tree restore 
and propagation restore and everything becomes much simpler. (And we 
already have CRIU implementation based on it, which helps with variety 
of bugs with mounts we previously had. 
https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#880)


I've also tried to consider another approach 
https://github.com/Snorch/linux/commit/84886f588527b062993ec3e9760c879163852518 
to disable actual propagation while restoring the mount tree. But with 
this approach it looks like it would be still hard to restore in CRIU 
because: With "MS_SET_GROUP" we don't care about roots. Imagine we have 
mount A (shared_id=1, root="/some/sub/path") and mount B (master_id=1, 
root="/some/other/sub/path", with MS_SET_GROUP we can copy sharing from 
mount A to B and make B MS_SLAVE to restore them. In case we want to do 
the same with only inheritance we would have to have some helper mount 
in this share C (shared_id=1, root="/") so that B can inherit this share...


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH v4 2/2] ovl: introduce new "uuid=off" option for inodes index feature

2020-09-28 Thread Pavel Tikhomirov

On 9/25/20 7:42 PM, Amir Goldstein wrote:

Apart from some typos, looks good to me.


Amir, Thanks a lot for your review!

> you should wait for more feedback from others

Sure, will wait.

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH v3 2/2] ovl: introduce new "uuid=off" option for inodes index feature

2020-09-25 Thread Pavel Tikhomirov
Thanks for review! I've sent v4 to address all your comments. I've also 
noticed that I had inconsistent config option naming:


+config OVERLAY_FS_INDEX_UUID_OFF
+static bool ovl_uuid_off_def = IS_ENABLED(CONFIG_OVERLAY_FS_UUID_OFF);

this is also fixed.

On 9/24/20 8:02 PM, Amir Goldstein wrote:

On Thu, Sep 24, 2020 at 7:38 PM Pavel Tikhomirov
 wrote:


This replaces uuid with null in overelayfs file handles and thus relaxes
uuid checks for overlay index feature. It is only possible in case there
is only one filesystem for all the work/upper/lower directories and bare
file handles from this backing filesystem are uniq. In other case when
we have multiple filesystems lets just fallback to "uuid=on" which is
and equivalent of how it worked before with all uuid checks.

This is needed when overlayfs is/was mounted in a container with index
enabled (e.g.: to be able to resolve inotify watch file handles on it to
paths in CRIU), and this container is copied and started alongside with
the original one. This way the "copy" container can't have the same uuid
on the superblock and mounting the overlayfs from it later would fail.

Note: In our (Virtuozzo) use case users inside a container can create
"regular" overlayfs mounts without any "index=" option, but we still
want to migrate this containers with CRIU so we set "index=on" as kernel
default so that all the container overlayfs mounts get support of file
handles automatically. With "uuid=off" we want the same thing (to be
able to "copy" container with uuid change) - we would set kernel default
so that all the container overlayfs mounts get "uuid=off" automatically.

That is an example of the problem on top of loop+ext4:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
   #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
   #mount: /loop-test/loop-mp/merged:
   #mount(2) system call failed: Stale file handle.

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
crate the copy of container and we require fs uuid to be uniq for a new
container.

v2: in v1 I missed actual uuid check skip
v3: rebase to overlayfs-next, replace uuid with null in file handles,
split ovl_fs propagation to function arguments to separate patch, add
separate bool "uuid=on/off" option, move numfs check up, add doc note.


change log does not belong in commit message.
Move after --- please.



CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 
---
  Documentation/filesystems/overlayfs.rst |  6 ++
  fs/overlayfs/Kconfig| 17 +
  fs/overlayfs/copy_up.c  |  3 ++-
  fs/overlayfs/namei.c|  5 -
  fs/overlayfs/ovl_entry.h|  1 +
  fs/overlayfs/super.c| 25 +
  6 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst 
b/Documentation/filesystems/overlayfs.rst
index 580ab9a0fe31..4f9cc20f255c 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -563,6 +563,12 @@ This verification may cause significant overhead in some 
cases.
  Note: the mount options index=off,nfs_export=on are conflicting for a
  read-write mount and will result in an error.

+Note: the mount option uuid=off (or corresponding module param, or kernel
+config) can be used to replace UUID of the underlying filesystem in file
+handles with null, and effectively disable UUID checks. This can be useful in
+case the underlying disk is copied and the UUID of this copy is changed. This
+is only applicable if all lower/upper/work directories are on the same
+filesystem, otherwise it will fallback to normal behaviour.

  Volatile mount
  --
diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
index dd188c7996b3..888c6e5e71ee 100644
--- a/fs/overlayfs/Kconfig
+++ b/fs/overlayfs/Kconfig
@@ -61,6 +61,23 @@ config OVERLAY_FS_INDEX

   If unsure, say N.

+config OVERLAY_FS_INDEX_UUID_OFF


Please get rid of all the double negatives.
config/param uuid_off should be uuid and default to Y.
Otherwise this looks fine to me.


+   bool "Overlayfs: export null uuid in file handles&qu

[PATCH v4 2/2] ovl: introduce new "uuid=off" option for inodes index feature

2020-09-25 Thread Pavel Tikhomirov
This replaces uuid with null in overelayfs file handles and thus relaxes
uuid checks for overlay index feature. It is only possible in case there
is only one filesystem for all the work/upper/lower directories and bare
file handles from this backing filesystem are uniq. In other case when
we have multiple filesystems lets just fallback to "uuid=on" which is
and equivalent of how it worked before with all uuid checks.

This is needed when overlayfs is/was mounted in a container with index
enabled (e.g.: to be able to resolve inotify watch file handles on it to
paths in CRIU), and this container is copied and started alongside with
the original one. This way the "copy" container can't have the same uuid
on the superblock and mounting the overlayfs from it later would fail.

Note: In our (Virtuozzo) use case users inside a container can create
"regular" overlayfs mounts without any "index=" option, but we still
want to migrate this containers with CRIU so we set "index=on" as kernel
default so that all the container overlayfs mounts get support of file
handles automatically. With "uuid=off" we want the same thing (to be
able to "copy" container with uuid change) - we would set kernel default
so that all the container overlayfs mounts get "uuid=off" automatically.

That is an example of the problem on top of loop+ext4:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
  #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
  #mount: /loop-test/loop-mp/merged:
  #mount(2) system call failed: Stale file handle.

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
crate the copy of container and we require fs uuid to be uniq for a new
container.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 

---
v2: in v1 I missed actual uuid check skip
v3: rebase to overlayfs-next, replace uuid with null in file handles,
split ovl_fs propagation to function arguments to separate patch, add
separate bool "uuid=on/off" option, move numfs check up, add doc note.
v4: get rid of double negatives, remove nouuid leftower comment, fix
missprint in kernel config name

 Documentation/filesystems/overlayfs.rst |  6 ++
 fs/overlayfs/Kconfig| 19 +++
 fs/overlayfs/copy_up.c  |  3 ++-
 fs/overlayfs/namei.c|  4 +++-
 fs/overlayfs/ovl_entry.h|  1 +
 fs/overlayfs/super.c| 25 +
 6 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst 
b/Documentation/filesystems/overlayfs.rst
index 580ab9a0fe31..4f9cc20f255c 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -563,6 +563,12 @@ This verification may cause significant overhead in some 
cases.
 Note: the mount options index=off,nfs_export=on are conflicting for a
 read-write mount and will result in an error.
 
+Note: the mount option uuid=off (or corresponding module param, or kernel
+config) can be used to replace UUID of the underlying filesystem in file
+handles with null, and effectively disable UUID checks. This can be useful in
+case the underlying disk is copied and the UUID of this copy is changed. This
+is only applicable if all lower/upper/work directories are on the same
+filesystem, otherwise it will fallback to normal behaviour.
 
 Volatile mount
 --
diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
index dd188c7996b3..c21abdb43206 100644
--- a/fs/overlayfs/Kconfig
+++ b/fs/overlayfs/Kconfig
@@ -61,6 +61,25 @@ config OVERLAY_FS_INDEX
 
  If unsure, say N.
 
+config OVERLAY_FS_INDEX_UUID
+   bool "Overlayfs: export uuid in file handles"
+   default y
+   depends on OVERLAY_FS
+   help
+ If this config option is disabled then overlay will replace uuid with
+ null in overlayfs file handles, effectively disabling uuid checks for
+ them. This affects overlayfs mounted with "index=on". This only can be
+ done if all upper and lower directories are on the same filesystem
+ where basic fhandles are uniq. In case the latter is not true
+ overlayfs wou

[PATCH v4 1/2] ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh

2020-09-25 Thread Pavel Tikhomirov
This will be used in next patch to be able to change uuid checks and
add uuid nullification based on ofs->config.index for a new "uuid=off"
mode.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 
---
 fs/overlayfs/copy_up.c   | 22 --
 fs/overlayfs/export.c| 10 ++
 fs/overlayfs/namei.c | 19 ++-
 fs/overlayfs/overlayfs.h | 14 --
 fs/overlayfs/util.c  |  3 ++-
 5 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 955ecd4030f0..3380039036d6 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -275,7 +275,8 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat 
*stat)
return err;
 }
 
-struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper)
+struct ovl_fh *ovl_encode_real_fh(struct ovl_fs *ofs, struct dentry *real,
+ bool is_upper)
 {
struct ovl_fh *fh;
int fh_type, dwords;
@@ -328,8 +329,8 @@ struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool 
is_upper)
return ERR_PTR(err);
 }
 
-int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
-  struct dentry *upper)
+int ovl_set_origin(struct ovl_fs *ofs, struct dentry *dentry,
+  struct dentry *lower, struct dentry *upper)
 {
const struct ovl_fh *fh = NULL;
int err;
@@ -340,7 +341,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry 
*lower,
 * up and a pure upper inode.
 */
if (ovl_can_decode_fh(lower->d_sb)) {
-   fh = ovl_encode_real_fh(lower, false);
+   fh = ovl_encode_real_fh(ofs, lower, false);
if (IS_ERR(fh))
return PTR_ERR(fh);
}
@@ -362,7 +363,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct 
dentry *upper,
const struct ovl_fh *fh;
int err;
 
-   fh = ovl_encode_real_fh(upper, true);
+   fh = ovl_encode_real_fh(ofs, upper, true);
if (IS_ERR(fh))
return PTR_ERR(fh);
 
@@ -380,6 +381,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct 
dentry *upper,
 static int ovl_create_index(struct dentry *dentry, struct dentry *origin,
struct dentry *upper)
 {
+   struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
struct dentry *indexdir = ovl_indexdir(dentry->d_sb);
struct inode *dir = d_inode(indexdir);
struct dentry *index = NULL;
@@ -402,7 +404,7 @@ static int ovl_create_index(struct dentry *dentry, struct 
dentry *origin,
if (WARN_ON(ovl_test_flag(OVL_INDEX, d_inode(dentry
return -EIO;
 
-   err = ovl_get_index_name(origin, &name);
+   err = ovl_get_index_name(ofs, origin, &name);
if (err)
return err;
 
@@ -411,7 +413,7 @@ static int ovl_create_index(struct dentry *dentry, struct 
dentry *origin,
if (IS_ERR(temp))
goto free_name;
 
-   err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), upper, temp);
+   err = ovl_set_upper_fh(ofs, upper, temp);
if (err)
goto out;
 
@@ -521,7 +523,7 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, 
struct dentry *temp)
 * hard link.
 */
if (c->origin) {
-   err = ovl_set_origin(c->dentry, c->lowerpath.dentry, temp);
+   err = ovl_set_origin(ofs, c->dentry, c->lowerpath.dentry, temp);
if (err)
return err;
}
@@ -700,7 +702,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c)
 static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 {
int err;
-   struct ovl_fs *ofs = c->dentry->d_sb->s_fs_info;
+   struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
bool to_index = false;
 
/*
@@ -722,7 +724,7 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 
if (to_index) {
c->destdir = ovl_indexdir(c->dentry->d_sb);
-   err = ovl_get_index_name(c->lowerpath.dentry, &c->destname);
+   err = ovl_get_index_name(ofs, c->lowerpath.dentry, 
&c->destname);
if (err)
return err;
} else if (WARN_ON(!c->parent)) {
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
index ed35be3fafc6..41ebf52f1bbc 100644
--- a/fs/overlayfs/export.c
+++ b/fs/overlayfs/export.c
@@ -211,7 +211,8 @@ static int ovl_check_encode_origin(struct dentry *dentry)
return 1;
 }
 
-static int ovl_dentry_to_fid(struct dentry *dentry, u32 *fid, int buflen)
+static int ovl_dentry_to_fid(struct ovl_fs *ofs, struct dentry *dentry,
+u32 *fid, int buflen)
 {
struct ovl_fh *fh = NULL;

[PATCH v4 0/2] ovl introduce "uuid=off"

2020-09-25 Thread Pavel Tikhomirov
This is a v4 of:
ovl: introduce new "index=nouuid" option for inodes index feature

Changes in v3: rebase to overlayfs-next, replace uuid with null in file
handles, propagate ovl_fs to needed functions in a separate patch, add
separate bool "uuid=on/off" option, fix numfs check fallback, add a note
to docs.

Changes in v4: get rid of double negatives, remove nouuid leftower
comment, fix missprint in kernel config name.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 

Pavel Tikhomirov (2):
  ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh
  ovl: introduce new "uuid=off" option for inodes index feature

 Documentation/filesystems/overlayfs.rst |  6 ++
 fs/overlayfs/Kconfig| 19 +++
 fs/overlayfs/copy_up.c  | 25 ++---
 fs/overlayfs/export.c   | 10 ++
 fs/overlayfs/namei.c| 23 +--
 fs/overlayfs/overlayfs.h| 14 --
 fs/overlayfs/ovl_entry.h|  1 +
 fs/overlayfs/super.c| 25 +
 fs/overlayfs/util.c |  3 ++-
 9 files changed, 94 insertions(+), 32 deletions(-)

-- 
2.26.2



[PATCH v3 2/2] ovl: introduce new "uuid=off" option for inodes index feature

2020-09-24 Thread Pavel Tikhomirov
This replaces uuid with null in overelayfs file handles and thus relaxes
uuid checks for overlay index feature. It is only possible in case there
is only one filesystem for all the work/upper/lower directories and bare
file handles from this backing filesystem are uniq. In other case when
we have multiple filesystems lets just fallback to "uuid=on" which is
and equivalent of how it worked before with all uuid checks.

This is needed when overlayfs is/was mounted in a container with index
enabled (e.g.: to be able to resolve inotify watch file handles on it to
paths in CRIU), and this container is copied and started alongside with
the original one. This way the "copy" container can't have the same uuid
on the superblock and mounting the overlayfs from it later would fail.

Note: In our (Virtuozzo) use case users inside a container can create
"regular" overlayfs mounts without any "index=" option, but we still
want to migrate this containers with CRIU so we set "index=on" as kernel
default so that all the container overlayfs mounts get support of file
handles automatically. With "uuid=off" we want the same thing (to be
able to "copy" container with uuid change) - we would set kernel default
so that all the container overlayfs mounts get "uuid=off" automatically.

That is an example of the problem on top of loop+ext4:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
  #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
  #mount: /loop-test/loop-mp/merged:
  #mount(2) system call failed: Stale file handle.

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
crate the copy of container and we require fs uuid to be uniq for a new
container.

v2: in v1 I missed actual uuid check skip
v3: rebase to overlayfs-next, replace uuid with null in file handles,
split ovl_fs propagation to function arguments to separate patch, add
separate bool "uuid=on/off" option, move numfs check up, add doc note.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 
---
 Documentation/filesystems/overlayfs.rst |  6 ++
 fs/overlayfs/Kconfig| 17 +
 fs/overlayfs/copy_up.c  |  3 ++-
 fs/overlayfs/namei.c|  5 -
 fs/overlayfs/ovl_entry.h|  1 +
 fs/overlayfs/super.c| 25 +
 6 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst 
b/Documentation/filesystems/overlayfs.rst
index 580ab9a0fe31..4f9cc20f255c 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -563,6 +563,12 @@ This verification may cause significant overhead in some 
cases.
 Note: the mount options index=off,nfs_export=on are conflicting for a
 read-write mount and will result in an error.
 
+Note: the mount option uuid=off (or corresponding module param, or kernel
+config) can be used to replace UUID of the underlying filesystem in file
+handles with null, and effectively disable UUID checks. This can be useful in
+case the underlying disk is copied and the UUID of this copy is changed. This
+is only applicable if all lower/upper/work directories are on the same
+filesystem, otherwise it will fallback to normal behaviour.
 
 Volatile mount
 --
diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
index dd188c7996b3..888c6e5e71ee 100644
--- a/fs/overlayfs/Kconfig
+++ b/fs/overlayfs/Kconfig
@@ -61,6 +61,23 @@ config OVERLAY_FS_INDEX
 
  If unsure, say N.
 
+config OVERLAY_FS_INDEX_UUID_OFF
+   bool "Overlayfs: export null uuid in file handles"
+   depends on OVERLAY_FS
+   help
+ If this config option is enabled then overlay will replace uuid with
+ null in overlayfs file handles, effectively disabling uuid checks for
+ them. This affects overlayfs mounted with "index=on". This only can be
+ done if all upper and lower directories are on the same filesystem
+ where basic fhandles are uniq. In case the latter is not true
+ overlayfs would fallback to normal uuid checking mode.
+
+ It is needed to overcome possible change of uuid on superblock 

[PATCH v3 0/2] ovl introduce "uuid=off"

2020-09-24 Thread Pavel Tikhomirov
This is a v3 of:
ovl: introduce new "index=nouuid" option for inodes index feature

Changes in v3: rebase to overlayfs-next, replace uuid with null in file
handles, propagate ovl_fs to needed functions in a separate patch, add
separate bool "uuid=on/off" option, fix numfs check fallback, add a note
to docs.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 

Pavel Tikhomirov (2):
  ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh
  ovl: introduce new "uuid=off" option for inodes index feature

 Documentation/filesystems/overlayfs.rst |  6 ++
 fs/overlayfs/Kconfig| 17 +
 fs/overlayfs/copy_up.c  | 25 ++---
 fs/overlayfs/export.c   | 10 ++
 fs/overlayfs/namei.c| 24 ++--
 fs/overlayfs/overlayfs.h| 14 --
 fs/overlayfs/ovl_entry.h|  1 +
 fs/overlayfs/super.c| 25 +
 fs/overlayfs/util.c |  3 ++-
 9 files changed, 93 insertions(+), 32 deletions(-)

-- 
2.26.2



[PATCH v3 1/2] ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh

2020-09-24 Thread Pavel Tikhomirov
This will be used in next patch to be able to change uuid checks and
add uuid nullification based on ofs->config.index for a new "uuid=off"
mode.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 
---
 fs/overlayfs/copy_up.c   | 22 --
 fs/overlayfs/export.c| 10 ++
 fs/overlayfs/namei.c | 19 ++-
 fs/overlayfs/overlayfs.h | 14 --
 fs/overlayfs/util.c  |  3 ++-
 5 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 955ecd4030f0..3380039036d6 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -275,7 +275,8 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat 
*stat)
return err;
 }
 
-struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper)
+struct ovl_fh *ovl_encode_real_fh(struct ovl_fs *ofs, struct dentry *real,
+ bool is_upper)
 {
struct ovl_fh *fh;
int fh_type, dwords;
@@ -328,8 +329,8 @@ struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool 
is_upper)
return ERR_PTR(err);
 }
 
-int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
-  struct dentry *upper)
+int ovl_set_origin(struct ovl_fs *ofs, struct dentry *dentry,
+  struct dentry *lower, struct dentry *upper)
 {
const struct ovl_fh *fh = NULL;
int err;
@@ -340,7 +341,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry 
*lower,
 * up and a pure upper inode.
 */
if (ovl_can_decode_fh(lower->d_sb)) {
-   fh = ovl_encode_real_fh(lower, false);
+   fh = ovl_encode_real_fh(ofs, lower, false);
if (IS_ERR(fh))
return PTR_ERR(fh);
}
@@ -362,7 +363,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct 
dentry *upper,
const struct ovl_fh *fh;
int err;
 
-   fh = ovl_encode_real_fh(upper, true);
+   fh = ovl_encode_real_fh(ofs, upper, true);
if (IS_ERR(fh))
return PTR_ERR(fh);
 
@@ -380,6 +381,7 @@ static int ovl_set_upper_fh(struct ovl_fs *ofs, struct 
dentry *upper,
 static int ovl_create_index(struct dentry *dentry, struct dentry *origin,
struct dentry *upper)
 {
+   struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
struct dentry *indexdir = ovl_indexdir(dentry->d_sb);
struct inode *dir = d_inode(indexdir);
struct dentry *index = NULL;
@@ -402,7 +404,7 @@ static int ovl_create_index(struct dentry *dentry, struct 
dentry *origin,
if (WARN_ON(ovl_test_flag(OVL_INDEX, d_inode(dentry
return -EIO;
 
-   err = ovl_get_index_name(origin, &name);
+   err = ovl_get_index_name(ofs, origin, &name);
if (err)
return err;
 
@@ -411,7 +413,7 @@ static int ovl_create_index(struct dentry *dentry, struct 
dentry *origin,
if (IS_ERR(temp))
goto free_name;
 
-   err = ovl_set_upper_fh(OVL_FS(dentry->d_sb), upper, temp);
+   err = ovl_set_upper_fh(ofs, upper, temp);
if (err)
goto out;
 
@@ -521,7 +523,7 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, 
struct dentry *temp)
 * hard link.
 */
if (c->origin) {
-   err = ovl_set_origin(c->dentry, c->lowerpath.dentry, temp);
+   err = ovl_set_origin(ofs, c->dentry, c->lowerpath.dentry, temp);
if (err)
return err;
}
@@ -700,7 +702,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c)
 static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 {
int err;
-   struct ovl_fs *ofs = c->dentry->d_sb->s_fs_info;
+   struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
bool to_index = false;
 
/*
@@ -722,7 +724,7 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 
if (to_index) {
c->destdir = ovl_indexdir(c->dentry->d_sb);
-   err = ovl_get_index_name(c->lowerpath.dentry, &c->destname);
+   err = ovl_get_index_name(ofs, c->lowerpath.dentry, 
&c->destname);
if (err)
return err;
} else if (WARN_ON(!c->parent)) {
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
index ed35be3fafc6..41ebf52f1bbc 100644
--- a/fs/overlayfs/export.c
+++ b/fs/overlayfs/export.c
@@ -211,7 +211,8 @@ static int ovl_check_encode_origin(struct dentry *dentry)
return 1;
 }
 
-static int ovl_dentry_to_fid(struct dentry *dentry, u32 *fid, int buflen)
+static int ovl_dentry_to_fid(struct ovl_fs *ofs, struct dentry *dentry,
+u32 *fid, int buflen)
 {
struct ovl_fh *fh = NULL;

Re: [PATCH v2] ovl: introduce new "index=nouuid" option for inodes index feature

2020-09-24 Thread Pavel Tikhomirov




On 9/23/20 7:09 PM, Amir Goldstein wrote:

On Wed, Sep 23, 2020 at 6:23 PM Pavel Tikhomirov
 wrote:


This relaxes uuid checks for overlay index feature. It is only possible
in case there is only one filesystem for all the work/upper/lower
directories and bare file handles from this backing filesystem are uniq.
In case we have multiple filesystems here just fall back to normal
"index=on".

This is needed when overlayfs is/was mounted in a container with
index enabled (e.g.: to be able to resolve inotify watch file handles on
it to paths in CRIU), and this container is copied and started alongside
with the original one. This way the "copy" container can't have the same
uuid on the superblock and mounting the overlayfs from it later would
fail.

That is an example of the problem on top of loop+ext4:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
   #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
   #mount: /loop-test/loop-mp/merged:
   #mount(2) system call failed: Stale file handle.

mount -t overlay overlay -oindex=nouuid,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
crate the copy of container and we require fs uuid to be uniq for a new
container.

v2: in v1 I missed actual uuid check skip - add it

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 
---


Look reasonable, but you need to rebase over
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs-next

Which makes a lot of your work unneeded.
ofs is propagated to most of the relevant helpers
and you should propagate it down to ovl_decode_real_fh().


Thanks! Will do.



Some minor comments below...


  fs/overlayfs/Kconfig | 16 +++
  fs/overlayfs/export.c|  6 ++--
  fs/overlayfs/namei.c | 35 +++
  fs/overlayfs/overlayfs.h | 23 +++
  fs/overlayfs/ovl_entry.h |  2 +-
  fs/overlayfs/super.c | 61 +---
  6 files changed, 106 insertions(+), 37 deletions(-)

diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
index dd188c7996b3..b00fd44006f9 100644
--- a/fs/overlayfs/Kconfig
+++ b/fs/overlayfs/Kconfig
@@ -61,6 +61,22 @@ config OVERLAY_FS_INDEX

   If unsure, say N.

+config OVERLAY_FS_INDEX_NOUUID
+   bool "Overlayfs: relax uuid checks of inodes index feature"
+   depends on OVERLAY_FS
+   depends on OVERLAY_FS_INDEX
+   help
+ If this config option is enabled then overlay will skip uuid checks
+ for index lower to upper inode map, this only can be done if all
+ upper and lower directories are on the same filesystem where basic
+ fhandles are uniq.
+
+ It is needed to overcome possible change of uuid on superblock of the
+ backing filesystem, e.g. when you copied the virtual disk and mount
+ both the copy of the disk and the original one at the same time.
+
+ If unsure, say N.
+


Please do not add a config option for this.
Isn't a mount option sufficient for your needs?


Users inside Virtuozzo container can mount overlayfs inside the CT (we 
assume that they do "regular" mounts without any "index=" option as 
docker does) so we wan't to setup the default in kernel config, so that 
all "regular" mounts of the user become "index=nouuid" automatically, 
and thus we would be able to both migrate (CRIU inotify resolution by 
fhandle on dump) and copy (copy disk with uuid change) the container 
without problem.





  config OVERLAY_FS_NFS_EXPORT
 bool "Overlayfs: turn on NFS export feature by default"
 depends on OVERLAY_FS
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
index 0e696f72cf65..d53feb7547d9 100644
--- a/fs/overlayfs/export.c
+++ b/fs/overlayfs/export.c
@@ -676,11 +676,12 @@ static struct dentry *ovl_upper_fh_to_d(struct 
super_block *sb,
 struct ovl_fs *ofs = sb->s_fs_info;
 struct dentry *dentry;
 struct dentry *upper;
+   bool nouuid = ofs->config.index == OVL_INDEX_NOUUID;

 if (!ovl_upper_mnt(ofs))
 return ERR_PTR(-EACCES);

-   upper = ovl_decode_real_fh(f

Re: [PATCH v2] ovl: introduce new "index=nouuid" option for inodes index feature

2020-09-24 Thread Pavel Tikhomirov




On 9/23/20 7:36 PM, Amir Goldstein wrote:

@@ -414,7 +415,7 @@ static int ovl_check_origin(struct ovl_fs *ofs, struct 
dentry *upperdentry,
   * Return 0 on match, -ESTALE on mismatch, < 0 on error.
   */
  static int ovl_verify_fh(struct dentry *dentry, const char *name,
-const struct ovl_fh *fh)
+const struct ovl_fh *fh, bool nouuid)
  {
 struct ovl_fh *ofh = ovl_get_fh(dentry, name);
 int err = 0;
@@ -425,8 +426,14 @@ static int ovl_verify_fh(struct dentry *dentry, const char 
*name,
 if (IS_ERR(ofh))
 return PTR_ERR(ofh);

-   if (fh->fb.len != ofh->fb.len || memcmp(&fh->fb, &ofh->fb, fh->fb.len))
+   if (fh->fb.len != ofh->fb.len) {
 err = -ESTALE;
+   } else {
+   if (nouuid && !uuid_equal(&fh->fb.uuid, &ofh->fb.uuid))
+   ofh->fb.uuid = fh->fb.uuid;
+   if (memcmp(&fh->fb, &ofh->fb, fh->fb.len))
+   err = -ESTALE;
+   }



On second thought I am wondering if we should do that differently.
If users want to work with index=nouuid, they need to work with it from day 1.
index=nouuid should export null uuid in NFS handles and write null uuid
in trusted.overlay.origin xattr.

So in ovl_encode_real_fh() you set null uuid and
instead of relaxing uuid_equal() in ovl_decode_real_fh()
you change it to uuid_is_null().

Do you have a problem with that for Virtuozzo use case?


Actually we've enabled index=on by default in kernel config in Virtuozzo 
only in the new update which is not yet released. So probably we can 
switch to index=nouuid with null uuid in fh.




Thanks,
Amir.



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


[PATCH v2] ovl: introduce new "index=nouuid" option for inodes index feature

2020-09-23 Thread Pavel Tikhomirov
This relaxes uuid checks for overlay index feature. It is only possible
in case there is only one filesystem for all the work/upper/lower
directories and bare file handles from this backing filesystem are uniq.
In case we have multiple filesystems here just fall back to normal
"index=on".

This is needed when overlayfs is/was mounted in a container with
index enabled (e.g.: to be able to resolve inotify watch file handles on
it to paths in CRIU), and this container is copied and started alongside
with the original one. This way the "copy" container can't have the same
uuid on the superblock and mounting the overlayfs from it later would
fail.

That is an example of the problem on top of loop+ext4:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
  #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
  #mount: /loop-test/loop-mp/merged:
  #mount(2) system call failed: Stale file handle.

mount -t overlay overlay -oindex=nouuid,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
crate the copy of container and we require fs uuid to be uniq for a new
container.

v2: in v1 I missed actual uuid check skip - add it

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 
---
 fs/overlayfs/Kconfig | 16 +++
 fs/overlayfs/export.c|  6 ++--
 fs/overlayfs/namei.c | 35 +++
 fs/overlayfs/overlayfs.h | 23 +++
 fs/overlayfs/ovl_entry.h |  2 +-
 fs/overlayfs/super.c | 61 +---
 6 files changed, 106 insertions(+), 37 deletions(-)

diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
index dd188c7996b3..b00fd44006f9 100644
--- a/fs/overlayfs/Kconfig
+++ b/fs/overlayfs/Kconfig
@@ -61,6 +61,22 @@ config OVERLAY_FS_INDEX
 
  If unsure, say N.
 
+config OVERLAY_FS_INDEX_NOUUID
+   bool "Overlayfs: relax uuid checks of inodes index feature"
+   depends on OVERLAY_FS
+   depends on OVERLAY_FS_INDEX
+   help
+ If this config option is enabled then overlay will skip uuid checks
+ for index lower to upper inode map, this only can be done if all
+ upper and lower directories are on the same filesystem where basic
+ fhandles are uniq.
+
+ It is needed to overcome possible change of uuid on superblock of the
+ backing filesystem, e.g. when you copied the virtual disk and mount
+ both the copy of the disk and the original one at the same time.
+
+ If unsure, say N.
+
 config OVERLAY_FS_NFS_EXPORT
bool "Overlayfs: turn on NFS export feature by default"
depends on OVERLAY_FS
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
index 0e696f72cf65..d53feb7547d9 100644
--- a/fs/overlayfs/export.c
+++ b/fs/overlayfs/export.c
@@ -676,11 +676,12 @@ static struct dentry *ovl_upper_fh_to_d(struct 
super_block *sb,
struct ovl_fs *ofs = sb->s_fs_info;
struct dentry *dentry;
struct dentry *upper;
+   bool nouuid = ofs->config.index == OVL_INDEX_NOUUID;
 
if (!ovl_upper_mnt(ofs))
return ERR_PTR(-EACCES);
 
-   upper = ovl_decode_real_fh(fh, ovl_upper_mnt(ofs), true);
+   upper = ovl_decode_real_fh(fh, ovl_upper_mnt(ofs), true, nouuid);
if (IS_ERR_OR_NULL(upper))
return upper;
 
@@ -700,6 +701,7 @@ static struct dentry *ovl_lower_fh_to_d(struct super_block 
*sb,
struct dentry *index = NULL;
struct inode *inode;
int err;
+   bool nouuid = ofs->config.index == OVL_INDEX_NOUUID;
 
/* First lookup overlay inode in inode cache by origin fh */
err = ovl_check_origin_fh(ofs, fh, false, NULL, &stack);
@@ -752,7 +754,7 @@ static struct dentry *ovl_lower_fh_to_d(struct super_block 
*sb,
goto out_err;
}
if (index) {
-   err = ovl_verify_origin(index, origin.dentry, false);
+   err = ovl_verify_origin(index, origin.dentry, false, nouuid);
if (err)
goto out_err;
}
diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index f7d4358db637..676170d719f8 100644
--- a/fs/overlayfs/namei.c
++

[PATCH] ovl: introduce new "index=nouuid" option for inodes index feature

2020-09-23 Thread Pavel Tikhomirov
This relaxes uuid checks for overlay index feature. It is only possible
in case there is only one filesystem for all the work/upper/lower
directories and bare file handles from this backing filesystem are uniq.
In case we have multiple filesystems here just fall back to normal
"index=on".

This is needed when overlayfs is/was mounted in a container with
index enabled (e.g.: to be able to resolve inotify watch file handles on
it to paths in CRIU), and this container is copied and started alongside
with the original one. This way the "copy" container can't have the same
uuid on the superblock and mounting the overlayfs from it later would
fail.

That is an example of the problem on top of loop+ext4:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
  #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 /root/loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
  #mount: /loop-test/loop-mp/merged:
  #mount(2) system call failed: Stale file handle.

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
crate the copy of container and we require fs uuid to be uniq for a new
container.

CC: Amir Goldstein 
CC: Vivek Goyal 
CC: Miklos Szeredi 
CC: linux-unio...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 
---
 fs/overlayfs/Kconfig | 16 
 fs/overlayfs/ovl_entry.h |  2 +-
 fs/overlayfs/super.c | 56 ++--
 3 files changed, 59 insertions(+), 15 deletions(-)

diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
index dd188c7996b3..b00fd44006f9 100644
--- a/fs/overlayfs/Kconfig
+++ b/fs/overlayfs/Kconfig
@@ -61,6 +61,22 @@ config OVERLAY_FS_INDEX
 
  If unsure, say N.
 
+config OVERLAY_FS_INDEX_NOUUID
+   bool "Overlayfs: relax uuid checks of inodes index feature"
+   depends on OVERLAY_FS
+   depends on OVERLAY_FS_INDEX
+   help
+ If this config option is enabled then overlay will skip uuid checks
+ for index lower to upper inode map, this only can be done if all
+ upper and lower directories are on the same filesystem where basic
+ fhandles are uniq.
+
+ It is needed to overcome possible change of uuid on superblock of the
+ backing filesystem, e.g. when you copied the virtual disk and mount
+ both the copy of the disk and the original one at the same time.
+
+ If unsure, say N.
+
 config OVERLAY_FS_NFS_EXPORT
bool "Overlayfs: turn on NFS export feature by default"
depends on OVERLAY_FS
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index b429c80879ee..2fd2cc515ad2 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -13,7 +13,7 @@ struct ovl_config {
bool redirect_dir;
bool redirect_follow;
const char *redirect_mode;
-   bool index;
+   int index;
bool nfs_export;
int xino;
bool metacopy;
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 4b38141c2985..617a5083e659 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -38,10 +38,16 @@ module_param_named(redirect_always_follow, 
ovl_redirect_always_follow,
 MODULE_PARM_DESC(redirect_always_follow,
 "Follow redirects even if redirect_dir feature is turned off");
 
-static bool ovl_index_def = IS_ENABLED(CONFIG_OVERLAY_FS_INDEX);
-module_param_named(index, ovl_index_def, bool, 0644);
+#define OVL_INDEX_OFF  0
+#define OVL_INDEX_ON   1
+#define OVL_INDEX_NOUUID   2
+
+static int ovl_index_def = IS_ENABLED(CONFIG_OVERLAY_FS_INDEX_NOUUID) ?
+  OVL_INDEX_NOUUID :
+  IS_ENABLED(CONFIG_OVERLAY_FS_INDEX);
+module_param_named(index, ovl_index_def, int, 0644);
 MODULE_PARM_DESC(index,
-"Default to on or off for the inodes index feature");
+"Default to on, off or nouuid for the inodes index feature");
 
 static bool ovl_nfs_export_def = IS_ENABLED(CONFIG_OVERLAY_FS_NFS_EXPORT);
 module_param_named(nfs_export, ovl_nfs_export_def, bool, 0644);
@@ -352,8 +358,18 @@ static int ovl_show_options(struct seq_file *m, struct 
dentry *dentry)
seq_puts(m, ",default_permissions");
if (strcmp(ofs->config.redirect_mode, ovl_redirect_mode_def()) != 0)
seq_printf(m, ",redirect_dir=%s", ofs-&g

Re: [PATCH] ovl: introduce new "index=nouuid" option for inodes index feature

2020-09-23 Thread Pavel Tikhomirov

Please drop, I accidentally missed several hunks...

On 9/23/20 3:50 PM, Pavel Tikhomirov wrote:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
   #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 /root/loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
   #mount: /loop-test/loop-mp/merged:
   #mount(2) system call failed: Stale file handle.

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
crate the copy of container and we require fs uuid to be uniq for a new
container.


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-04 Thread Pavel Tikhomirov




On 8/4/20 8:43 AM, Andrei Vagin wrote:

On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote:

On 30.07.2020 17:34, Eric W. Biederman wrote:

Kirill Tkhai  writes:


Currently, there is no a way to list or iterate all or subset of namespaces
in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
but some also may be as open files, which are not attached to a process.
When a namespace open fd is sent over unix socket and then closed, it is
impossible to know whether the namespace exists or not.

Also, even if namespace is exposed as attached to a process or as open file,
iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
this multiplies at tasks and fds number.


Could you describe with more details when you need to iterate
namespaces?

There are three ways to hold namespaces.

* processes
* bind-mounts
* file descriptors

When CRIU dumps a container, it enumirates all processes, collects file
descriptors and mounts. This means that we will be able to collect all
namespaces, doesn't it?


Yes we can. But it would be much easier for us to have all namespaces in 
one place isn't it?


And this patch-set has another non-CRIU use case. It can simplify a view 
to namespaces for a normal user. Lets consider some cases:


Lets assume we have an empty (no processes) mount namespace M which is 
held by single open fd, which was put in a unix socket and closed, unix 
socket has single open fd to it which was in it's turn put to another 
unix socket and again and again until we reach unix socket max depth... 
How should normal user find this mount namespace M?


Lets assume that M also has a nsfs bindmount which helds some empty 
network namespace N... How should normal user find N?


Lets also assume that M has overmounted "/":

mount -t tmpfs tmpfs /

Now if you would enter M you would see single tmpfs (because of implicit 
chroot to overmount on setns) in mountinfo and there is no way to see 
full mountinfo if you does not know real root dentry... How should 
normal user (or even CRIU) find N?


So my personal opinion is that we need this interface, maybe it should 
be done somehow different but we need it.






--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-07-31 Thread Pavel Tikhomirov




On 7/31/20 1:13 AM, Eric W. Biederman wrote:

Kirill Tkhai  writes:


On 30.07.2020 17:34, Eric W. Biederman wrote:

Kirill Tkhai  writes:


Currently, there is no a way to list or iterate all or subset of namespaces
in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
but some also may be as open files, which are not attached to a process.
When a namespace open fd is sent over unix socket and then closed, it is
impossible to know whether the namespace exists or not.

Also, even if namespace is exposed as attached to a process or as open file,
iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
this multiplies at tasks and fds number.


I am very dubious about this.

I have been avoiding exactly this kind of interface because it can
create rather fundamental problems with checkpoint restart.


restart/restore :)


You do have some filtering and the filtering is not based on current.
Which is good.

A view that is relative to a user namespace might be ok.It almost
certainly does better as it's own little filesystem than as an extension
to proc though.

The big thing we want to ensure is that if you migrate you can restore
everything.  I don't see how you will be able to restore these files
after migration.  Anything like this without having a complete
checkpoint/restore story is a non-starter.


There is no difference between files in /proc/namespaces/ directory and 
/proc/[pid]/ns/.

CRIU can restore open files in /proc/[pid]/ns, the same will be with 
/proc/namespaces/ files.
As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't 
see any
problem here.


An obvious diffference is that you are adding the inode to the inode to
the file name.  Which means that now you really do have to preserve the
inode numbers during process migration.

Which means now we have to do all of the work to make inode number
restoration possible.  Which means now we need to have multiple
instances of nsfs so that we can restore inode numbers.

I think this is still possible but we have been delaying figuring out
how to restore inode numbers long enough that may be actual technical
problems making it happen.

Now maybe CRIU can handle the names of the files changing during
migration but you have just increased the level of difficulty for doing
that.


Yes adding /proc/namespaces/:[] files may be a problem 
to CRIU.


First I would like to highlight that open files are not a problem. 
Because open file from /proc/namespaces/* are exactly the same as open 
files from /proc//ns/. So when we c/r an nsfs open file fd 
on dump we readlink the fd and get :[] and on restore 
we recreate each dumped namespace and open an fd to each, so we can 
'dup' it when restoring open file. It will be an fd to topologically 
same namespace though ns_ino would be newly generated.


But the problem I see is with readdir. What if some task is reading 
/proc/namespaces/ directory at the time of dump, after restore directory 
will contain new names for namespaces and possibly in different order, 
this way if process continues to readdir it can miss some namespaces or 
read some twice.


May be instead of multiple files in /proc/namespaces directory, we can 
leave just one file /proc/namespaces and when we open it we would return 
e.g. a unix socket filled with all the fds of all namespacess visible at 
this point. It looks like a possible solution to the above problem.


CRIU can restore unix sockets with fds inside, so we should be able to 
dump process using this functionality.





If you have a specific worries about, let's discuss them.


I was asking and I am asking that it be described in the patch
description how a container using this feature can be migrated
from one machine to another.  This code is so close to being problematic
that we need be very careful we don't fundamentally break CRIU while
trying to make it's job simpler and easier.


CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R.
  

Further by not going through the processes it looks like you are
bypassing the existing permission checks.  Which has the potential
to allow someone to use a namespace who would not be able to otherwise.


I agree, and I wrote to Christian, that permissions should be more strict.
This just should be formalized. Let's discuss this.


So I think this goes one step too far but I am willing to be persuaded
otherwise.



Eric



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH 0/2] overlayfs: C/R enhancements

2020-06-05 Thread Pavel Tikhomirov




On 6/5/20 5:35 AM, Amir Goldstein wrote:

On Fri, Jun 5, 2020 at 12:34 AM Alexander Mikhalitsyn
 wrote:


Hello,


But overlayfs won't accept these "output only" options as input args,

which is a problem.

Will it be problematic if we simply ignore "lowerdir_mnt_id" and 
"upperdir_mnt_id" options in ovl_parse_opt()?



That would solve this small problem.


This is not a big problem actually as these options shown in mountinfo 
for overlay had been "output only" forever, please see these two 
examples below:


a) Imagine you've mounted overlay with relative paths and forgot (or you 
never known as you are another user) where your cwd was at the moment of 
mount syscall. - How would you use those options as "input" to create 
the same overlay mount somethere else (bind-mounting not involved)?


b) Imagine you've mounted overlay with absolute paths and someone (other 
user) overmounted lower (upper/workdir) paths for you, all directory 
structure would be the same on overmount but yet files are different. - 
How would you use those options from mountinfo as "input" ones?


We try to make them much closer to "input" ones.

Agreed, we should ignore *_mnt_id on mount because paths identify mounts 
at the time of mount call.





Wouldn't it be better for C/R to implement mount options

that overlayfs can parse and pass it mntid and fhandle instead
of paths? >>
Problem is that we need to know on C/R "dump stage" which mounts are used on lower layers 
and upper layer. Most likely I don't understand something but I can't catch how 
"mount-time" options will help us.


As you already know from inotify/fanotify C/R fhandle is timeless, so
there would be no distinction between mount time and dump time.


Pair of fhandle+mnt_id looks an equivalent to path+mnt_id pair, CRIU 
will just need to open fhandle+mnt_id with open_by_handle_at and 
readlink to get path on dump and continue to use path+mnt_id as before. 
(not too common with fhandles but it's my current understanding)


But if you take a look on (a) and (b) again, the regular user does not 
see full information about overlay mount in /proc/pid/mountinfo, they 
can't just take a look on it and understand from there it comes from. 
Resolving fhandle looks like a too hard task for a user.



About mnt_id, your patches will cause the original mount-time mounts to be busy.
That is a problem as well.


Children mounts lock parent, open files lock parent. Another analogy is 
a loop device which locks the backing file mount (AFAICS). Anyway one 
can lazy umount, can't they? But I'm not too sure for this one, maybe 
you can share more implications of this problem?




I think you should describe the use case is more details.
Is your goal to C/R any overlayfs mount that the process has open
files on? visible to process
We wan't to dump a container, not a simple process, if the container 
process has access to some resource CRIU needs to restore this resource.


Imagine the process in container mounts it's own overlay inside 
container, for instance to imulate write access to readonly mount or 
just to implement some snapshots, don't know exact use case. And we want 
to checkpoint/restore this container. (Currently CRIU only supports 
overlay as external mount, e.g. for docker continers docker engine 
pre-creates overlay for us and we just bind from it - it's a different 
case.) If the in-container process creates the in-container mount we 
need to recreate it on restore so that the in-container view of the 
filesystem persists.



For NFS export, we use the persistent descriptor {uuid;fhandle}
(a.k.a. struct ovl_fh) to encode
an underlying layer object.

CRIU can look for an existing mount to a filesystem with uuid as restore stage
(or even mount this filesystem) and use open_by_handle_at() to open a
path to layer.


On restore we can be on another physical node, so I doubt we have same 
uuid's, sorry I don't fully understand here already.



After mounting overlay, that mount to underlying fs can even be discarded.

And if this works for you, you don't have to export the layers ovl_fh in
/proc/mounts, you can export them in numerous other ways.
One way from the top of my head, getxattr on overlay root dir.
"trusted.overlay" xattr is anyway a reserved prefix, so "trusted.overlay.layers"
for example could work.


Thanks xattr might be a good option, but still don't forget about (a) 
and (b), users like to know all information about mount from 
/proc/pid/mountinfo.




Thanks,
Amir.



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] sunrpc: fix crash when cache_head become valid before update

2019-10-08 Thread Pavel Tikhomirov
Add Neil to CC, sorry, had lost it somehow...

On 10/1/19 11:03 AM, Pavel Tikhomirov wrote:
> I was investigating a crash in our Virtuozzo7 kernel which happened in
> in svcauth_unix_set_client. I found out that we access m_client field
> in ip_map structure, which was received from sunrpc_cache_lookup (we
> have a bit older kernel, now the code is in sunrpc_cache_add_entry), and
> these field looks uninitialized (m_client == 0x74 don't look like a
> pointer) but in the cache_head in flags we see 0x1 which is CACHE_VALID.
> 
> It looks like the problem appeared from our previous fix to sunrpc (1):
> commit 4ecd55ea0742 ("sunrpc: fix cache_head leak due to queued
> request")
> 
> And we've also found a patch already fixing our patch (2):
> commit d58431eacb22 ("sunrpc: don't mark uninitialised items as VALID.")
> 
> Though the crash is eliminated, I think the core of the problem is not
> completely fixed:
> 
> Neil in the patch (2) makes cache_head CACHE_NEGATIVE, before
> cache_fresh_locked which was added in (1) to fix crash. These way
> cache_is_valid won't say the cache is valid anymore and in
> svcauth_unix_set_client the function cache_check will return error
> instead of 0, and we don't count entry as initialized.
> 
> But it looks like we need to remove cache_fresh_locked completely in
> sunrpc_cache_lookup:
> 
> In (1) we've only wanted to make cache_fresh_unlocked->cache_dequeue so
> that cache_requests with no readers also release corresponding
> cache_head, to fix their leak.  We with Vasily were not sure if
> cache_fresh_locked and cache_fresh_unlocked should be used in pair or
> not, so we've guessed to use them in pair.
> 
> Now we see that we don't want the CACHE_VALID bit set here by
> cache_fresh_locked, as "valid" means "initialized" and there is no
> initialization in sunrpc_cache_add_entry. Both expiry_time and
> last_refresh are not used in cache_fresh_unlocked code-path and also not
> required for the initial fix.
> 
> So to conclude cache_fresh_locked was called by mistake, and we can just
> safely remove it instead of crutching it with CACHE_NEGATIVE. It looks
> ideologically better for me. Hope I don't miss something here.
> 
> Here is our crash backtrace:
> [13108726.326291] BUG: unable to handle kernel NULL pointer dereference at 
> 0074
> [13108726.326365] IP: [] 
> svcauth_unix_set_client+0x2ab/0x520 [sunrpc]
> [13108726.326448] PGD 0
> [13108726.326468] Oops: 0002 [#1] SMP
> [13108726.326497] Modules linked in: nbd isofs xfs loop 
> kpatch_cumulative_81_0_r1(O) xt_physdev nfnetlink_queue bluetooth rfkill 
> ip6table_nat nf_nat_ipv6 ip_vs_wrr ip_vs_wlc ip_vs_sh nf_conntrack_netlink 
> ip_vs_sed ip_vs_pe_sip nf_conntrack_sip ip_vs_nq ip_vs_lc ip_vs_lblcr 
> ip_vs_lblc ip_vs_ftp ip_vs_dh nf_nat_ftp nf_conntrack_ftp iptable_raw 
> xt_recent nf_log_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common xt_LOG xt_limit 
> xt_TCPMSS xt_tcpmss vxlan ip6_udp_tunnel udp_tunnel xt_statistic xt_NFLOG 
> nfnetlink_log dummy xt_mark xt_REDIRECT nf_nat_redirect raw_diag udp_diag 
> tcp_diag inet_diag netlink_diag af_packet_diag unix_diag rpcsec_gss_krb5 
> xt_addrtype ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
> nf_reject_ipv6 ebtable_nat ebtable_broute nf_conntrack_ipv6 nf_defrag_ipv6 
> ip6table_mangle ip6table_raw nfsv4
> [13108726.327173]  dns_resolver cls_u32 binfmt_misc arptable_filter 
> arp_tables ip6table_filter ip6_tables devlink fuse_kio_pcs ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 xt_comment 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_wdog_tmo xt_multiport bonding xt_set 
> xt_conntrack iptable_filter iptable_mangle kpatch(O) ebtable_filter ebt_among 
> ebtables ip_set_hash_ip ip_set nfnetlink vfat fat skx_edac intel_powerclamp 
> coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass fuse pcspkr ses 
> enclosure joydev sg mei_me hpwdt hpilo lpc_ich mei ipmi_si shpchp 
> ipmi_devintf ipmi_msghandler xt_ipvs acpi_power_meter ip_vs_rr nfsv3 nfsd 
> auth_rpcgss nfs_acl nfs lockd grace fscache nf_nat cls_fw sch_htb sch_cbq 
> sch_sfq ip_vs em_u32 nf_conntrack tun br_netfilter veth overlay ip6_vzprivnet 
> ip6_vznetstat ip_vznetstat
> [13108726.327817]  ip_vzprivnet vziolimit vzevent vzlist vzstat vznetstat 
> vznetdev vzmon vzdev bridge pio_kaio pio_nfs pio_direct pfmt_raw pfmt_ploop1 
> ploop ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 
> i2c_algo_bit drm_kms_helper scsi_transport_iscsi 8021q syscopyarea 
> sysfillrect garp sysimgblt fb_sys_fops mrp stp ttm llc bnx2x crct10dif_pclmul 
> crct10dif_common crc32_pclmul crc32c_intel drm dm_multipath 
> ghash_clmulni_intel uas aesni_inte

[PATCH] sunrpc: fix crash when cache_head become valid before update

2019-10-01 Thread Pavel Tikhomirov
a0a6a41b1160 ti: a0c2a74bc000 task.ti: 
a0c2a74bc000
[13108726.328610] RIP: 0010:[]  [] 
svcauth_unix_set_client+0x2ab/0x520 [sunrpc]
[13108726.328706] RSP: 0018:a0c2a74bfd80  EFLAGS: 00010246
[13108726.328750] RAX: 0001 RBX: a0a6183ae000 RCX: 

[13108726.328811] RDX: 0074 RSI: 0286 RDI: 
a0c2a74bfcf0
[13108726.328864] RBP: a0c2a74bfe00 R08: a0bab8c22960 R09: 
0001
[13108726.328916] R10: 0001 R11: 0001 R12: 
a0a32aa7f000
[13108726.328969] R13: a0a6183afac0 R14: a0c233d88d00 R15: 
a0c2a74bfdb4
[13108726.329022] FS:  () GS:a0e17f9c() 
knlGS:
[13108726.329081] CS:  0010 DS:  ES:  CR0: 80050033
[13108726.332311] CR2: 0074 CR3: 0026a1b28000 CR4: 
007607e0
[13108726.334606] DR0:  DR1:  DR2: 

[13108726.336754] DR3:  DR6: fffe0ff0 DR7: 
0400
[13108726.338908] PKRU: 
[13108726.341047] Call Trace:
[13108726.343074]  [] ? groups_alloc+0x34/0x110
[13108726.344837]  [] svc_set_client+0x24/0x30 [sunrpc]
[13108726.346631]  [] svc_process_common+0x241/0x710 [sunrpc]
[13108726.348332]  [] svc_process+0x103/0x190 [sunrpc]
[13108726.350016]  [] nfsd+0xdf/0x150 [nfsd]
[13108726.351735]  [] ? nfsd_destroy+0x80/0x80 [nfsd]
[13108726.353459]  [] kthread+0xd1/0xe0
[13108726.355195]  [] ? create_kthread+0x60/0x60
[13108726.356896]  [] ret_from_fork_nospec_begin+0x7/0x21
[13108726.358577]  [] ? create_kthread+0x60/0x60
[13108726.360240] Code: 4c 8b 45 98 0f 8e 2e 01 00 00 83 f8 fe 0f 84 76 fe ff 
ff 85 c0 0f 85 2b 01 00 00 49 8b 50 40 b8 01 00 00 00 48 89 93 d0 1a 00 00  
0f c1 02 83 c0 01 83 f8 01 0f 8e 53 02 00 00 49 8b 44 24 38
[13108726.363769] RIP  [] svcauth_unix_set_client+0x2ab/0x520 
[sunrpc]
[13108726.365530]  RSP 
[13108726.367179] CR2: 0074

Fixes: d58431eacb22 ("sunrpc: don't mark uninitialised items as VALID.")
Signed-off-by: Pavel Tikhomirov 

---
 net/sunrpc/cache.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index a349094f6fb7..f740cb51802a 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -53,9 +53,6 @@ static void cache_init(struct cache_head *h, struct 
cache_detail *detail)
h->last_refresh = now;
 }
 
-static inline int cache_is_valid(struct cache_head *h);
-static void cache_fresh_locked(struct cache_head *head, time_t expiry,
-   struct cache_detail *detail);
 static void cache_fresh_unlocked(struct cache_head *head,
struct cache_detail *detail);
 
@@ -105,9 +102,6 @@ static struct cache_head *sunrpc_cache_add_entry(struct 
cache_detail *detail,
if (cache_is_expired(detail, tmp)) {
hlist_del_init_rcu(&tmp->cache_list);
detail->entries --;
-   if (cache_is_valid(tmp) == -EAGAIN)
-   set_bit(CACHE_NEGATIVE, &tmp->flags);
-   cache_fresh_locked(tmp, 0, detail);
freeme = tmp;
break;
}
-- 
2.21.0



mm/swap: possible problem introduced when replacing REQ_NOIDLE with REQ_IDLE

2019-08-09 Thread Pavel Tikhomirov
Hi, all.

Then porting patches from mainstream I've found some strange code:

 > commit a2b809672ee6fcb4d5756ea815725b3dbaea654e
 > Author: Christoph Hellwig 
 > Date:   Tue Nov 1 07:40:09 2016 -0600
 >
 > block: replace REQ_NOIDLE with REQ_IDLE
 >
 > Noidle should be the default for writes as seen by all the compounds
 > definitions in fs.h using it.  In fact only direct I/O really should
 > be using NODILE, so turn the whole flag around to get the defaults
 > right, which will make our life much easier especially onces the
 > WRITE_* defines go away.
 >
 > This assumes all the existing "raw" users of REQ_SYNC for writes
 > want noidle behavior, which seems to be spot on from a quick audit.
 >
 > Signed-off-by: Christoph Hellwig 
 > Signed-off-by: Jens Axboe 
 >
 > diff --git a/include/linux/fs.h b/include/linux/fs.h
 > index ccedccb28ec8..46a74209917f 100644
 > --- a/include/linux/fs.h
 > +++ b/include/linux/fs.h
 > @@ -197,11 +197,11 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, 
loff_t offset,
 >  #define WRITE  REQ_OP_WRITE
 >
 >  #define READ_SYNC  0
 > -#define WRITE_SYNC (REQ_SYNC | REQ_NOIDLE)
 > -#define WRITE_ODIRECT  REQ_SYNC
 > -#define WRITE_FLUSH(REQ_NOIDLE | REQ_PREFLUSH)
 > -#define WRITE_FUA  (REQ_NOIDLE | REQ_FUA)
 > -#define WRITE_FLUSH_FUA(REQ_NOIDLE | REQ_PREFLUSH | 
REQ_FUA)
 > +#define WRITE_SYNC REQ_SYNC
 > +#define WRITE_ODIRECT  (REQ_SYNC | REQ_IDLE)
 > +#define WRITE_FLUSHREQ_PREFLUSH
 > +#define WRITE_FUA  REQ_FUA
 > +#define WRITE_FLUSH_FUA(REQ_PREFLUSH | REQ_FUA)
 >
 >  /*
 >   * Attribute flags.  These should be or-ed together to figure out what

The above commit changes the meaning of the REQ_SYNC flag, before the 
patch it was equal to WRITE_ODIRECT and after the patch it is equal to 
WRITE_SYNC. And thus I think it became treated differently (I see only 
one place left in wbt_should_throttle.).

But in __swap_writepage() both before and after the mentioned patch we 
still pass a single REQ_SYNC without any REQ_IDLE/REQ_UNIDLE:

 > [snorch@snorch linux]$ git blame 
a2b809672ee6fcb4d5756ea815725b3dbaea654e^ mm/page_io.c | grep -a5 REQ_SYNC
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 319) 
 unlock_page(page);
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 320) 
 ret = -ENOMEM;
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 321) 
 goto out;
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 322)   }
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 323) 
if (wbc->sync_mode == WB_SYNC_ALL)
 > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 324) 
 bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC);
 > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 325) 
else
 > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 326) 
 bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
 > f8891e5e1f93a (Christoph Lameter 2006-06-30 01:55:45 -0700 327) 
count_vm_event(PSWPOUT);
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 328) 
set_page_writeback(page);
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 329) 
unlock_page(page);
 > [snorch@snorch linux]$ git blame 
a2b809672ee6fcb4d5756ea815725b3dbaea654e mm/page_io.c | grep -a5 REQ_SYNC
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 319) 
 unlock_page(page);
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 320) 
 ret = -ENOMEM;
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 321) 
 goto out;
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 322)   }
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 323) 
if (wbc->sync_mode == WB_SYNC_ALL)
 > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 324) 
 bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC);
 > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 325) 
else
 > ba13e83ec334c (Jens Axboe2016-08-01 09:38:44 -0600 326) 
 bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
 > f8891e5e1f93a (Christoph Lameter 2006-06-30 01:55:45 -0700 327) 
count_vm_event(PSWPOUT);
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 328) 
set_page_writeback(page);
 > ^1da177e4c3f4 (Linus Torvalds2005-04-16 15:20:36 -0700 329) 
unlock_page(page);

It looks like we've changed the way how we handle swap page writes from 
"odirect" way to "regular" sync write way, these can be wrong. This may 
also affect deprecated cfq io-scheduler on older kernels.

Thanks in advance for any advice on what to do with these, may be I miss 
something.

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


[PATCH] ext4: remove unnecessary gotos in ext4_xattr_set_entry

2019-05-31 Thread Pavel Tikhomirov
In the "out" label we only iput old/new_ea_inode-s, in all these places
these variables are always NULL so there is no point in goto to "out".

Signed-off-by: Pavel Tikhomirov 
---
 fs/ext4/xattr.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 491f9ee4040e..ac2ddd4446b3 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1601,8 +1601,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
next = EXT4_XATTR_NEXT(last);
if ((void *)next >= s->end) {
EXT4_ERROR_INODE(inode, "corrupted xattr entries");
-   ret = -EFSCORRUPTED;
-   goto out;
+   return -EFSCORRUPTED;
}
if (!last->e_value_inum && last->e_value_size) {
size_t offs = le16_to_cpu(last->e_value_offs);
@@ -1620,8 +1619,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
free += EXT4_XATTR_LEN(name_len) + old_size;
 
if (free < EXT4_XATTR_LEN(name_len) + new_size) {
-   ret = -ENOSPC;
-   goto out;
+   return -ENOSPC;
}
 
/*
@@ -1634,8 +1632,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
new_size && is_block &&
(min_offs + old_size - new_size) <
EXT4_XATTR_BLOCK_RESERVE(inode)) {
-   ret = -ENOSPC;
-   goto out;
+   return -ENOSPC;
}
}
 
-- 
2.20.1



Re: [PATCH] tracing: Fix event filters and triggers to handle negative numbers

2019-02-28 Thread Pavel Tikhomirov
ping, looks like the patch was lost

On 8/24/18 3:48 AM, Steven Rostedt wrote:
> On Thu, 23 Aug 2018 13:25:34 +0300
> Pavel Tikhomirov  wrote:
> 
>> Then tracing syscall exit event it is extremely useful to filter exit
>> codes equal to some negative value, to react only to required errors.
>> But negative numbers does not work:
>>
>> [root@snorch sys_exit_read]# echo "ret == -1" > filter
>> bash: echo: write error: Invalid argument
>> [root@snorch sys_exit_read]# cat filter
>> ret == -1
>>  ^
>> parse_error: Invalid value (did you forget quotes)?
> 
> Thanks for the patch. I'll apply it and then start testing it!
> 
> -- Steve
> 
>>
>> Similar thing happens when setting triggers.
>>
>> These is a regression in v4.17 introduced by the commit mentioned below,
>> testing without these commit shows no problem with negative numbers.
>>
>> Fixes: 80765597bc58 ("tracing: Rewrite filter logic to be simpler and 
>> faster")
>> Signed-off-by: Pavel Tikhomirov 
>> ---
>>   kernel/trace/trace_events_filter.c | 5 -
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/trace/trace_events_filter.c 
>> b/kernel/trace/trace_events_filter.c
>> index 84a65173b1e9..2ba449292561 100644
>> --- a/kernel/trace/trace_events_filter.c
>> +++ b/kernel/trace/trace_events_filter.c
>> @@ -1299,7 +1299,7 @@ static int parse_pred(const char *str, void *data,
>>  /* go past the last quote */
>>  i++;
>>   
>> -} else if (isdigit(str[i])) {
>> +} else if (isdigit(str[i]) || str[i] == '-') {
>>   
>>  /* Make sure the field is not a string */
>>  if (is_string_field(field)) {
>> @@ -1312,6 +1312,9 @@ static int parse_pred(const char *str, void *data,
>>  goto err_free;
>>  }
>>   
>> +if (str[i] == '-')
>> +i++;
>> +
>>  /* We allow 0xDEADBEEF */
>>  while (isalnum(str[i]))
>>  i++;
> 

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] vhost: return EINVAL if iovecs size does not match the message size

2018-12-13 Thread Pavel Tikhomirov
On 12/13/2018 10:55 PM, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 05:53:50PM +0300, Pavel Tikhomirov wrote:
>> We've failed to copy and process vhost_iotlb_msg so let userspace at
>> least know about it. For instance before these patch the code below runs
>> without any error:
>>
>> int main()
>> {
>>struct vhost_msg msg;
>>struct iovec iov;
>>int fd;
>>
>>fd = open("/dev/vhost-net", O_RDWR);
>>if (fd == -1) {
>>  perror("open");
>>  return 1;
>>}
>>
>>iov.iov_base = &msg;
>>iov.iov_len = sizeof(msg)-4;
>>
>>if (writev(fd, &iov,1) == -1) {
>>  perror("writev");
>>  return 1;
>>}
>>
>>return 0;
>> }
>>
>> Signed-off-by: Pavel Tikhomirov 
> 
> Thanks for the patch!
> 
>> ---
>>   drivers/vhost/vhost.c | 8 ++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 3a5f81a66d34..03014224ef13 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -1024,8 +1024,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>>  int type, ret;
>>   
>>  ret = copy_from_iter(&type, sizeof(type), from);
>> -if (ret != sizeof(type))
>> +if (ret != sizeof(type)) {
>> +ret = -EINVAL;
>>  goto done;
>> +}
>>   
>>  switch (type) {
>>  case VHOST_IOTLB_MSG:
> 
> should this be EFAULT rather?

We already have "Invalid argument" returned when wrong type of vhost_msg 
received, I though it would be fine to return same thing if we have 
wrong size of vhost_msg.

When we return "Bad address" because of vhost_process_iotlb_msg fail, it 
is because our vhost_dev has no ->iotlb so our problem is not connected 
to the data passed from userspace but with the state of vhost_dev.

So I like EINVAL more in these two cases.

> 
>> @@ -1044,8 +1046,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>>   
>>  iov_iter_advance(from, offset);
>>  ret = copy_from_iter(&msg, sizeof(msg), from);
>> -if (ret != sizeof(msg))
>> +if (ret != sizeof(msg)) {
>> +ret = -EINVAL;
>>  goto done;
>> +}
>>  if (vhost_process_iotlb_msg(dev, &msg)) {
>>  ret = -EFAULT;
>>  goto done;
> 
> This too?
> 
>> -- 
>> 2.17.1

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


[PATCH] vhost: return EINVAL if iovecs size does not match the message size

2018-12-13 Thread Pavel Tikhomirov
We've failed to copy and process vhost_iotlb_msg so let userspace at
least know about it. For instance before these patch the code below runs
without any error:

int main()
{
  struct vhost_msg msg;
  struct iovec iov;
  int fd;

  fd = open("/dev/vhost-net", O_RDWR);
  if (fd == -1) {
perror("open");
return 1;
  }

  iov.iov_base = &msg;
  iov.iov_len = sizeof(msg)-4;

  if (writev(fd, &iov,1) == -1) {
perror("writev");
return 1;
  }

  return 0;
}

Signed-off-by: Pavel Tikhomirov 
---
 drivers/vhost/vhost.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 3a5f81a66d34..03014224ef13 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1024,8 +1024,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
int type, ret;
 
ret = copy_from_iter(&type, sizeof(type), from);
-   if (ret != sizeof(type))
+   if (ret != sizeof(type)) {
+   ret = -EINVAL;
goto done;
+   }
 
switch (type) {
case VHOST_IOTLB_MSG:
@@ -1044,8 +1046,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
 
iov_iter_advance(from, offset);
ret = copy_from_iter(&msg, sizeof(msg), from);
-   if (ret != sizeof(msg))
+   if (ret != sizeof(msg)) {
+   ret = -EINVAL;
goto done;
+   }
if (vhost_process_iotlb_msg(dev, &msg)) {
ret = -EFAULT;
goto done;
-- 
2.17.1



[PATCH] nfs: fix comment to nfs_generic_pg_test which does the opposite

2018-11-14 Thread Pavel Tikhomirov
Please see comment to filelayout_pg_test for reference.

To: Trond Myklebust 
Cc: Anna Schumaker 
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov 
---
 fs/nfs/pagelist.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 5c4568a0804b..87f3da1fd850 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -461,7 +461,7 @@ EXPORT_SYMBOL_GPL(nfs_wait_on_request);
  * @prev: previous request in desc, or NULL
  * @req: this request
  *
- * Returns zero if @req can be coalesced into @desc, otherwise it returns
+ * Returns zero if @req cannot be coalesced into @desc, otherwise it returns
  * the size of the request.
  */
 size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
-- 
2.17.1



[PATCH v2] mm: cleancache: fix corruption on missed inode invalidation

2018-11-12 Thread Pavel Tikhomirov
If all pages are deleted from the mapping by memory reclaim and also
moved to the cleancache:

__delete_from_page_cache
  (no shadow case)
  unaccount_page_cache_page
cleancache_put_page
  page_cache_delete
mapping->nrpages -= nr
(nrpages becomes 0)

We don't clean the cleancache for an inode after final file truncation
(removal).

truncate_inode_pages_final
  check (nrpages || nrexceptional) is false
no truncate_inode_pages
  no cleancache_invalidate_inode(mapping)

These way when reading the new file created with same inode we may get
these trash leftover pages from cleancache and see wrong data instead of
the contents of the new file.

Fix it by always doing truncate_inode_pages which is already ready for
nrpages == 0 && nrexceptional == 0 case and just invalidates inode.

v2: add comment

Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache")
To: Andrew Morton 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Jan Kara 
Cc: Matthew Wilcox 
Cc: Andi Kleen 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Vasily Averin 
Reviewed-by: Andrey Ryabinin 
Reviewed-by: Jan Kara 
Signed-off-by: Pavel Tikhomirov 
---
 mm/truncate.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 45d68e90b703..2c5285767ce5 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -517,9 +517,14 @@ void truncate_inode_pages_final(struct address_space 
*mapping)
 */
xa_lock_irq(&mapping->i_pages);
xa_unlock_irq(&mapping->i_pages);
-
-   truncate_inode_pages(mapping, 0);
}
+
+   /*
+* Cleancache needs notification even if there are no pages or shadow
+* entries, else we will leave leftover pages in the cleancache for
+* a deleted inode.
+*/
+   truncate_inode_pages(mapping, 0);
 }
 EXPORT_SYMBOL(truncate_inode_pages_final);
 
-- 
2.17.1



[PATCH] mm: cleancache: fix corruption on missed inode invalidation

2018-11-12 Thread Pavel Tikhomirov
If all pages are deleted from the mapping by memory reclaim and also
moved to the cleancache:

__delete_from_page_cache
  (no shadow case)
  unaccount_page_cache_page
cleancache_put_page
  page_cache_delete
mapping->nrpages -= nr
(nrpages becomes 0)

We don't clean the cleancache for an inode after final file truncation
(removal).

truncate_inode_pages_final
  check (nrpages || nrexceptional) is false
no truncate_inode_pages
  no cleancache_invalidate_inode(mapping)

These way when reading the new file created with same inode we may get
these trash leftover pages from cleancache and see wrong data instead of
the contents of the new file.

Fix it by always doing truncate_inode_pages which is already ready for
nrpages == 0 && nrexceptional == 0 case and just invalidates inode.

Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache")
To: Andrew Morton 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Jan Kara 
Cc: Matthew Wilcox 
Cc: Andi Kleen 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Vasily Averin 
Reviewed-by: Andrey Ryabinin 
Signed-off-by: Pavel Tikhomirov 
---
 mm/truncate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 45d68e90b703..4c56c19e76eb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -517,9 +517,9 @@ void truncate_inode_pages_final(struct address_space 
*mapping)
 */
xa_lock_irq(&mapping->i_pages);
xa_unlock_irq(&mapping->i_pages);
-
-   truncate_inode_pages(mapping, 0);
}
+
+   truncate_inode_pages(mapping, 0);
 }
 EXPORT_SYMBOL(truncate_inode_pages_final);
 
-- 
2.17.1



[PATCH] tracing: Fix event filters and triggers to handle negative numbers

2018-08-23 Thread Pavel Tikhomirov
Then tracing syscall exit event it is extremely useful to filter exit
codes equal to some negative value, to react only to required errors.
But negative numbers does not work:

[root@snorch sys_exit_read]# echo "ret == -1" > filter
bash: echo: write error: Invalid argument
[root@snorch sys_exit_read]# cat filter
ret == -1
^
parse_error: Invalid value (did you forget quotes)?

Similar thing happens when setting triggers.

These is a regression in v4.17 introduced by the commit mentioned below,
testing without these commit shows no problem with negative numbers.

Fixes: 80765597bc58 ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Pavel Tikhomirov 
---
 kernel/trace/trace_events_filter.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_filter.c 
b/kernel/trace/trace_events_filter.c
index 84a65173b1e9..2ba449292561 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1299,7 +1299,7 @@ static int parse_pred(const char *str, void *data,
/* go past the last quote */
i++;
 
-   } else if (isdigit(str[i])) {
+   } else if (isdigit(str[i]) || str[i] == '-') {
 
/* Make sure the field is not a string */
if (is_string_field(field)) {
@@ -1312,6 +1312,9 @@ static int parse_pred(const char *str, void *data,
goto err_free;
}
 
+   if (str[i] == '-')
+   i++;
+
/* We allow 0xDEADBEEF */
while (isalnum(str[i]))
i++;
-- 
2.17.1



Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy

2017-11-21 Thread Pavel Tikhomirov

Great news, that it works for you!

Thanks a lot!
Pavel

On 11/22/2017 03:49 AM, Stuart Hayes wrote:

My apologies... yes, your patch also fixes my issue.  I was looking at the two 
new places from which you were calling scsi_eh_wakeup(), and didn't notice that 
you moved the spinlock in scsi_device_unbusy()... moving the spinlock in 
scsi_device_unbusy() also should the issue I'm seeing, given that 
scsi_eh_scmd_add() also uses the spinlock.

I tested your patch on my issue, and it did indeed fix my issue.

So you can add...

Tested-by: Stuart Hayes 

Thanks
Stuart


On 11/21/2017 2:09 AM, Pavel Tikhomirov wrote:

My patch should also fix your issue too, please see explanation in reply to 
your patch. Do your testing show that it doesn't?

Thanks, Pavel.

On 11/21/2017 09:10 AM, Stuart Hayes wrote:

Pavel,

It turns out that the error handler on our systems was not getting woken up for 
a different reason... I submitted a patch earlier today that fixes the issue I 
were seeing (I CCed you on the patch).

Before I got my hands on the failing system and was able to root cause it, I 
was pretty sure that your patch was going to fix our issue, because after I 
examined the code paths, I couldn't find any other reason that the error 
handler would not get woken up.  I tried forcing the bug that your patch fixes 
to occur, by compiling in some mdelay()s in a key place or two in the scsi 
code, but it never failed for me that way.  With my patch, several systems that 
previously failed in 10 minutes or less successfully ran for many days.

Thanks,
Stuart

On 11/9/2017 8:54 AM, Pavel Tikhomirov wrote:

Are there any issues with this patch 
(https://patchwork.kernel.org/patch/9938919/) that Pavel Tikhomirov submitted 
back in September?  I am willing to help if there's anything I can do to help 
get it accepted.


Hi, Stuart, I asked James Bottomley about the patch status offlist and it seems 
that the problem is - patch lacks testing and review. I would highly appreciate 
review from your side and anyone who wants to participate!

And if you can confirm that the patch solves the problem on your environment with no side 
effects please add "Tested-by:" tag also.

Thanks, Pavel

On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote:

We have a problem on several our nodes with scsi EH. Imagine such an
order of execution of two threads:

CPU1 scsi_eh_scmd_add    CPU2 scsi_host_queue_ready
/* shost->host_busy == 1 initialy */

  if (shost->shost_state == SHOST_RECOVERY)
  /* does not get here */
  return 0;

lock(shost->host_lock);
shost->shost_state = SHOST_RECOVERY;

  busy = shost->host_busy++;
  /* host->can_queue == 1 initialy, busy == 1
   * - go to starved label */
  lock(shost->host_lock) /* wait */

shost->host_failed++;
/* shost->host_busy == 2, shost->host_failed == 1 */
call scsi_eh_wakeup(shost) {
  if (host_busy == host_failed) {
  /* does not get here */
  wake_up_process(shost->ehandler)
  }
}
unlock(shost->host_lock)

  /* acquire lock */
  shost->host_busy--;

Finaly we do not wakeup scsi_error_handler and all other commands
coming will hang as we are in never ending recovery state as there
is no one left to wakeup handler.

So scsi disc in these host becomes unresponsive and all bio on node
hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.)

Main idea of the fix is to try to do wake up every time we decrement
host_busy or increment host_failed(the latter is already OK).

Now the very *last* one of busy threads getting host_lock after
decrementing host_busy will see all write operations on host's
shost_state, host_busy and host_failed completed thanks to implied
memory barriers on spin_lock/unlock, so at the time of busy==failed
we will trigger wakeup in at least one thread. (Thats why putting
recovery and failed checks under lock)

Signed-off-by: Pavel Tikhomirov 
---
    drivers/scsi/scsi_lib.c | 21 +
    1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..6c99221d60aa 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev)
    if (starget->can_queue > 0)
    atomic_dec(&starget->target_busy);
    +    spin_lock_irqsave(shost->host_lock, flags);
    if (unlikely(scsi_host_in_recovery(shost) &&
- (shost->host_failed || shost->host_eh_scheduled))) {
-    spin_lock_irqsave(shost->host_lock, flags);
+ (shost->host_failed || shost->host_eh_scheduled)))
    scsi_eh_wakeup(shost);
-    spin_unlock_irqrestore(shost->host_lock, flags);
-    }
+    spin_unlock_

Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy

2017-11-21 Thread Pavel Tikhomirov
JFYI these patch is in Virtuozzo7 kernel from September, and we have no 
issues found with it until now by out testing, and initial problem does 
not reproduce for 2.5 months.


Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy

2017-11-21 Thread Pavel Tikhomirov
My patch should also fix your issue too, please see explanation in reply 
to your patch. Do your testing show that it doesn't?


Thanks, Pavel.

On 11/21/2017 09:10 AM, Stuart Hayes wrote:

Pavel,

It turns out that the error handler on our systems was not getting woken up for 
a different reason... I submitted a patch earlier today that fixes the issue I 
were seeing (I CCed you on the patch).

Before I got my hands on the failing system and was able to root cause it, I 
was pretty sure that your patch was going to fix our issue, because after I 
examined the code paths, I couldn't find any other reason that the error 
handler would not get woken up.  I tried forcing the bug that your patch fixes 
to occur, by compiling in some mdelay()s in a key place or two in the scsi 
code, but it never failed for me that way.  With my patch, several systems that 
previously failed in 10 minutes or less successfully ran for many days.

Thanks,
Stuart

On 11/9/2017 8:54 AM, Pavel Tikhomirov wrote:

Are there any issues with this patch 
(https://patchwork.kernel.org/patch/9938919/) that Pavel Tikhomirov submitted 
back in September?  I am willing to help if there's anything I can do to help 
get it accepted.


Hi, Stuart, I asked James Bottomley about the patch status offlist and it seems 
that the problem is - patch lacks testing and review. I would highly appreciate 
review from your side and anyone who wants to participate!

And if you can confirm that the patch solves the problem on your environment with no side 
effects please add "Tested-by:" tag also.

Thanks, Pavel

On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote:

We have a problem on several our nodes with scsi EH. Imagine such an
order of execution of two threads:

CPU1 scsi_eh_scmd_add    CPU2 scsi_host_queue_ready
/* shost->host_busy == 1 initialy */

     if (shost->shost_state == SHOST_RECOVERY)
     /* does not get here */
     return 0;

lock(shost->host_lock);
shost->shost_state = SHOST_RECOVERY;

     busy = shost->host_busy++;
     /* host->can_queue == 1 initialy, busy == 1
  * - go to starved label */
     lock(shost->host_lock) /* wait */

shost->host_failed++;
/* shost->host_busy == 2, shost->host_failed == 1 */
call scsi_eh_wakeup(shost) {
 if (host_busy == host_failed) {
     /* does not get here */
     wake_up_process(shost->ehandler)
 }
}
unlock(shost->host_lock)

     /* acquire lock */
     shost->host_busy--;

Finaly we do not wakeup scsi_error_handler and all other commands
coming will hang as we are in never ending recovery state as there
is no one left to wakeup handler.

So scsi disc in these host becomes unresponsive and all bio on node
hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.)

Main idea of the fix is to try to do wake up every time we decrement
host_busy or increment host_failed(the latter is already OK).

Now the very *last* one of busy threads getting host_lock after
decrementing host_busy will see all write operations on host's
shost_state, host_busy and host_failed completed thanks to implied
memory barriers on spin_lock/unlock, so at the time of busy==failed
we will trigger wakeup in at least one thread. (Thats why putting
recovery and failed checks under lock)

Signed-off-by: Pavel Tikhomirov 
---
   drivers/scsi/scsi_lib.c | 21 +
   1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..6c99221d60aa 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev)
   if (starget->can_queue > 0)
   atomic_dec(&starget->target_busy);
   +    spin_lock_irqsave(shost->host_lock, flags);
   if (unlikely(scsi_host_in_recovery(shost) &&
- (shost->host_failed || shost->host_eh_scheduled))) {
-    spin_lock_irqsave(shost->host_lock, flags);
+ (shost->host_failed || shost->host_eh_scheduled)))
   scsi_eh_wakeup(shost);
-    spin_unlock_irqrestore(shost->host_lock, flags);
-    }
+    spin_unlock_irqrestore(shost->host_lock, flags);
     atomic_dec(&sdev->device_busy);
   }
@@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct 
request_queue *q,
   spin_unlock_irq(shost->host_lock);
   out_dec:
   atomic_dec(&shost->host_busy);
+
+    spin_lock_irq(shost->host_lock);
+    if (unlikely(scsi_host_in_recovery(shost) &&
+ (shost->host_failed || shost->host_eh_scheduled)))
+    scsi_eh_wakeup(shost);
+    spin_unlock_irq(shost->host_lock);
+
   return 0;
   }
   @@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct 
blk_mq_hw_ctx *hc

Re: [PATCH] scsi_error: ensure EH wakes up on error to prevent host getting stuck

2017-11-21 Thread Pavel Tikhomirov

On 11/20/2017 10:11 PM, Stuart Hayes wrote:

When a command is added to the host's error handler command queue, there is a 
chance that the error handler will not be woken up.  This can happen when one CPU 
is running scsi_eh_scmd_add() at the same time as another CPU is running 
scsi_device_unbusy() for a different command on the same host.  Each function 
changes one value, and then looks at the value of a variable that the other 
function has just changed, but if they both see stale data, neither will actually 
wake up the error handle >
In scsi_eh_scmd_add, host_failed is incremented, then scsi_eh_wakeup() is 
called, which sees that host_busy is still 2, so it doesn't actually wake up 
the handler.  Meanwhile, in scsi_device_unbusy(), host_busy is decremented, and 
then it sees that host_failed is 0, so it doesn't even call scsi_eh_wakeup().


If in scsi_eh_scmd_add() we call scsi_eh_wakeup() it is done under 
spinlock, so we have implied memory barrier here. All stores which we've 
done before we had released the lock will be seen by any other thread in 
same critical section if the thread took spinlock after us. So later 
scsi_device_unbusy() in it's scsi_eh_wakeup() also sees host_failed==1.


Actually the problem is that in scsi_device_unbusy() the check below:

if (unlikely(scsi_host_in_recovery(shost) &&
 (shost->host_failed || shost->host_eh_scheduled))

is not under same spinlock, so that host_failed can be actually be 0 at 
these point and we never get to scsi_eh_wakeup, which my patch "scsi/eh: 
fix hang adding ehandler wakeups after decrementing host_busy" also 
fixes by putting these check under proper lock and thus the implied 
barrier is added and we don't need actual barrier here.


Please see "LOCK ACQUISITION FUNCTIONS" 
Documentation/memory-barriers.txt for further information about implied 
memory barriers.




Signed-off-by: Stuart Hyaes 

---
diff -pur linux-4.14/drivers/scsi/scsi_error.c 
linux-4.14-stu/drivers/scsi/scsi_error.c
--- linux-4.14/drivers/scsi/scsi_error.c2017-11-12 12:46:13.0 
-0600
+++ linux-4.14-stu/drivers/scsi/scsi_error.c2017-11-17 14:22:19.230867923 
-0600
@@ -243,6 +243,10 @@ void scsi_eh_scmd_add(struct scsi_cmnd *
scsi_eh_reset(scmd);
list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
shost->host_failed++;
+   /*
+* See scsi_device_unbusy() for explanation of smp_mb().
+*/
+   smp_mb();
scsi_eh_wakeup(shost);
spin_unlock_irqrestore(shost->host_lock, flags);
  }
diff -pur linux-4.14/drivers/scsi/scsi_lib.c 
linux-4.14-stu/drivers/scsi/scsi_lib.c
--- linux-4.14/drivers/scsi/scsi_lib.c  2017-11-12 12:46:13.0 -0600
+++ linux-4.14-stu/drivers/scsi/scsi_lib.c  2017-11-17 14:22:15.814867833 
-0600
@@ -325,6 +325,15 @@ void scsi_device_unbusy(struct scsi_devi
unsigned long flags;
  
  	atomic_dec(&shost->host_busy);

+   
+   /* This function changes host_busy and looks at host_failed, while
+* scsi_eh_scmd_add() updates host_failed and looks at host_busy (in
+* scsi_eh_wakeup())... if these happen simultaneously without the smp
+* memory barrier, each can see the old value, such that neither will
+* wake up the error handler, which can cause the host controller to
+* be hung forever.
+*/
+   smp_mb();
if (starget->can_queue > 0)
atomic_dec(&starget->target_busy);
  



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy

2017-11-09 Thread Pavel Tikhomirov
> Are there any issues with this patch 
(https://patchwork.kernel.org/patch/9938919/) that Pavel Tikhomirov 
submitted back in September?  I am willing to help if there's anything I 
can do to help get it accepted.


Hi, Stuart, I asked James Bottomley about the patch status offlist and 
it seems that the problem is - patch lacks testing and review. I would 
highly appreciate review from your side and anyone who wants to participate!


And if you can confirm that the patch solves the problem on your 
environment with no side effects please add "Tested-by:" tag also.


Thanks, Pavel

On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote:

We have a problem on several our nodes with scsi EH. Imagine such an
order of execution of two threads:

CPU1 scsi_eh_scmd_add   CPU2 scsi_host_queue_ready
/* shost->host_busy == 1 initialy */

if (shost->shost_state == SHOST_RECOVERY)
/* does not get here */
return 0;

lock(shost->host_lock);
shost->shost_state = SHOST_RECOVERY;

busy = shost->host_busy++;
/* host->can_queue == 1 initialy, busy == 1
 * - go to starved label */
lock(shost->host_lock) /* wait */

shost->host_failed++;
/* shost->host_busy == 2, shost->host_failed == 1 */
call scsi_eh_wakeup(shost) {
if (host_busy == host_failed) {
/* does not get here */
wake_up_process(shost->ehandler)
}
}
unlock(shost->host_lock)

/* acquire lock */
shost->host_busy--;

Finaly we do not wakeup scsi_error_handler and all other commands
coming will hang as we are in never ending recovery state as there
is no one left to wakeup handler.

So scsi disc in these host becomes unresponsive and all bio on node
hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.)

Main idea of the fix is to try to do wake up every time we decrement
host_busy or increment host_failed(the latter is already OK).

Now the very *last* one of busy threads getting host_lock after
decrementing host_busy will see all write operations on host's
shost_state, host_busy and host_failed completed thanks to implied
memory barriers on spin_lock/unlock, so at the time of busy==failed
we will trigger wakeup in at least one thread. (Thats why putting
recovery and failed checks under lock)

Signed-off-by: Pavel Tikhomirov 
---
  drivers/scsi/scsi_lib.c | 21 +
  1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..6c99221d60aa 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev)
if (starget->can_queue > 0)
atomic_dec(&starget->target_busy);
  
+	spin_lock_irqsave(shost->host_lock, flags);

if (unlikely(scsi_host_in_recovery(shost) &&
-(shost->host_failed || shost->host_eh_scheduled))) {
-   spin_lock_irqsave(shost->host_lock, flags);
+(shost->host_failed || shost->host_eh_scheduled)))
scsi_eh_wakeup(shost);
-   spin_unlock_irqrestore(shost->host_lock, flags);
-   }
+   spin_unlock_irqrestore(shost->host_lock, flags);
  
  	atomic_dec(&sdev->device_busy);

  }
@@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct 
request_queue *q,
spin_unlock_irq(shost->host_lock);
  out_dec:
atomic_dec(&shost->host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
return 0;
  }
  
@@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
  
  out_dec_host_busy:

atomic_dec(&shost->host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
  out_dec_target_busy:
if (scsi_target(sdev)->can_queue > 0)
atomic_dec(&scsi_target(sdev)->target_busy);



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy

2017-10-20 Thread Pavel Tikhomirov

ping

On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote:

We have a problem on several our nodes with scsi EH. Imagine such an
order of execution of two threads:

CPU1 scsi_eh_scmd_add   CPU2 scsi_host_queue_ready
/* shost->host_busy == 1 initialy */

if (shost->shost_state == SHOST_RECOVERY)
/* does not get here */
return 0;

lock(shost->host_lock);
shost->shost_state = SHOST_RECOVERY;

busy = shost->host_busy++;
/* host->can_queue == 1 initialy, busy == 1
 * - go to starved label */
lock(shost->host_lock) /* wait */

shost->host_failed++;
/* shost->host_busy == 2, shost->host_failed == 1 */
call scsi_eh_wakeup(shost) {
if (host_busy == host_failed) {
/* does not get here */
wake_up_process(shost->ehandler)
}
}
unlock(shost->host_lock)

/* acquire lock */
shost->host_busy--;

Finaly we do not wakeup scsi_error_handler and all other commands
coming will hang as we are in never ending recovery state as there
is no one left to wakeup handler.

So scsi disc in these host becomes unresponsive and all bio on node
hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.)

Main idea of the fix is to try to do wake up every time we decrement
host_busy or increment host_failed(the latter is already OK).

Now the very *last* one of busy threads getting host_lock after
decrementing host_busy will see all write operations on host's
shost_state, host_busy and host_failed completed thanks to implied
memory barriers on spin_lock/unlock, so at the time of busy==failed
we will trigger wakeup in at least one thread. (Thats why putting
recovery and failed checks under lock)

Signed-off-by: Pavel Tikhomirov 
---
  drivers/scsi/scsi_lib.c | 21 +
  1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..6c99221d60aa 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev)
if (starget->can_queue > 0)
atomic_dec(&starget->target_busy);
  
+	spin_lock_irqsave(shost->host_lock, flags);

if (unlikely(scsi_host_in_recovery(shost) &&
-(shost->host_failed || shost->host_eh_scheduled))) {
-   spin_lock_irqsave(shost->host_lock, flags);
+(shost->host_failed || shost->host_eh_scheduled)))
scsi_eh_wakeup(shost);
-   spin_unlock_irqrestore(shost->host_lock, flags);
-   }
+   spin_unlock_irqrestore(shost->host_lock, flags);
  
  	atomic_dec(&sdev->device_busy);

  }
@@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct 
request_queue *q,
spin_unlock_irq(shost->host_lock);
  out_dec:
atomic_dec(&shost->host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
return 0;
  }
  
@@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
  
  out_dec_host_busy:

atomic_dec(&shost->host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
  out_dec_target_busy:
if (scsi_target(sdev)->can_queue > 0)
atomic_dec(&scsi_target(sdev)->target_busy);



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy

2017-10-04 Thread Pavel Tikhomirov
Hi. Please tell if there is something I can do to help the patch get 
processed? It is on the list without reply for almost a month.


On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote:

We have a problem on several our nodes with scsi EH. Imagine such an
order of execution of two threads:

CPU1 scsi_eh_scmd_add   CPU2 scsi_host_queue_ready
/* shost->host_busy == 1 initialy */

if (shost->shost_state == SHOST_RECOVERY)
/* does not get here */
return 0;

lock(shost->host_lock);
shost->shost_state = SHOST_RECOVERY;

busy = shost->host_busy++;
/* host->can_queue == 1 initialy, busy == 1
 * - go to starved label */
lock(shost->host_lock) /* wait */

shost->host_failed++;
/* shost->host_busy == 2, shost->host_failed == 1 */
call scsi_eh_wakeup(shost) {
if (host_busy == host_failed) {
/* does not get here */
wake_up_process(shost->ehandler)
}
}
unlock(shost->host_lock)

/* acquire lock */
shost->host_busy--;

Finaly we do not wakeup scsi_error_handler and all other commands
coming will hang as we are in never ending recovery state as there
is no one left to wakeup handler.

So scsi disc in these host becomes unresponsive and all bio on node
hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.)

Main idea of the fix is to try to do wake up every time we decrement
host_busy or increment host_failed(the latter is already OK).

Now the very *last* one of busy threads getting host_lock after
decrementing host_busy will see all write operations on host's
shost_state, host_busy and host_failed completed thanks to implied
memory barriers on spin_lock/unlock, so at the time of busy==failed
we will trigger wakeup in at least one thread. (Thats why putting
recovery and failed checks under lock)

Signed-off-by: Pavel Tikhomirov 
---
  drivers/scsi/scsi_lib.c | 21 +
  1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..6c99221d60aa 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev)
if (starget->can_queue > 0)
atomic_dec(&starget->target_busy);
  
+	spin_lock_irqsave(shost->host_lock, flags);

if (unlikely(scsi_host_in_recovery(shost) &&
-(shost->host_failed || shost->host_eh_scheduled))) {
-   spin_lock_irqsave(shost->host_lock, flags);
+(shost->host_failed || shost->host_eh_scheduled)))
scsi_eh_wakeup(shost);
-   spin_unlock_irqrestore(shost->host_lock, flags);
-   }
+   spin_unlock_irqrestore(shost->host_lock, flags);
  
  	atomic_dec(&sdev->device_busy);

  }
@@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct 
request_queue *q,
spin_unlock_irq(shost->host_lock);
  out_dec:
atomic_dec(&shost->host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
return 0;
  }
  
@@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
  
  out_dec_host_busy:

atomic_dec(&shost->host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
  out_dec_target_busy:
if (scsi_target(sdev)->can_queue > 0)
atomic_dec(&scsi_target(sdev)->target_busy);



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


[PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy

2017-09-05 Thread Pavel Tikhomirov
We have a problem on several our nodes with scsi EH. Imagine such an
order of execution of two threads:

CPU1 scsi_eh_scmd_add   CPU2 scsi_host_queue_ready
/* shost->host_busy == 1 initialy */

if (shost->shost_state == SHOST_RECOVERY)
/* does not get here */
return 0;

lock(shost->host_lock);
shost->shost_state = SHOST_RECOVERY;

busy = shost->host_busy++;
/* host->can_queue == 1 initialy, busy == 1
 * - go to starved label */
lock(shost->host_lock) /* wait */

shost->host_failed++;
/* shost->host_busy == 2, shost->host_failed == 1 */
call scsi_eh_wakeup(shost) {
if (host_busy == host_failed) {
/* does not get here */
wake_up_process(shost->ehandler)
}
}
unlock(shost->host_lock)

/* acquire lock */
shost->host_busy--;

Finaly we do not wakeup scsi_error_handler and all other commands
coming will hang as we are in never ending recovery state as there
is no one left to wakeup handler.

So scsi disc in these host becomes unresponsive and all bio on node
hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.)

Main idea of the fix is to try to do wake up every time we decrement
host_busy or increment host_failed(the latter is already OK).

Now the very *last* one of busy threads getting host_lock after
decrementing host_busy will see all write operations on host's
shost_state, host_busy and host_failed completed thanks to implied
memory barriers on spin_lock/unlock, so at the time of busy==failed
we will trigger wakeup in at least one thread. (Thats why putting
recovery and failed checks under lock)

Signed-off-by: Pavel Tikhomirov 
---
 drivers/scsi/scsi_lib.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..6c99221d60aa 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev)
if (starget->can_queue > 0)
atomic_dec(&starget->target_busy);
 
+   spin_lock_irqsave(shost->host_lock, flags);
if (unlikely(scsi_host_in_recovery(shost) &&
-(shost->host_failed || shost->host_eh_scheduled))) {
-   spin_lock_irqsave(shost->host_lock, flags);
+(shost->host_failed || shost->host_eh_scheduled)))
scsi_eh_wakeup(shost);
-   spin_unlock_irqrestore(shost->host_lock, flags);
-   }
+   spin_unlock_irqrestore(shost->host_lock, flags);
 
atomic_dec(&sdev->device_busy);
 }
@@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct 
request_queue *q,
spin_unlock_irq(shost->host_lock);
 out_dec:
atomic_dec(&shost->host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
return 0;
 }
 
@@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
 
 out_dec_host_busy:
atomic_dec(&shost->host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
 out_dec_target_busy:
if (scsi_target(sdev)->can_queue > 0)
atomic_dec(&scsi_target(sdev)->target_busy);
-- 
2.13.5



[PATCH v4 2/2] prctl: propagate has_child_subreaper flag to every descendant

2017-01-30 Thread Pavel Tikhomirov
If process forks some children when it has is_child_subreaper
flag enabled they will inherit has_child_subreaper flag - first
group, when is_child_subreaper is disabled forked children will
not inherit it - second group. So child-subreaper does not reparent
all his descendants when their parents die. Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.

To simplify these we can add a propagation of has_child_subreaper
flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
subreaper to setup has_child_subreaper flag.

In common cases when process like systemd first sets itself to
be a child-subreaper and only after that forks its services, we will
have zero-length list of descendants to walk. Testing with binary
subtree of 2^15 processes prctl took < 0.007 sec and has shown close
to linear dependency(~0.2 * n * usec) on lower numbers of processes.

Moreover, I doubt someone intentionaly pre-forks the children whitch
should reparent to init before becoming subreaper, because some our
ancestor migh have had is_child_subreaper flag while forking our
sub-tree and our childs will all inherit has_child_subreaper flag,
and we have no way to influence it. And only way to check if we have
no has_child_subreaper flag is to create some childs, kill them and
see where they will reparent to.

Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing
seems to be the same.

Optimize:

a) When descendant already has has_child_subreaper flag all his subtree
has it too already.

* for a) to be true need to move has_child_subreaper inheritance under
the same tasklist_lock with adding task to its ->real_parent->children
as without it process can inherit zero has_child_subreaper, then we
set 1 to it's parent flag, check that parent has no more children, and
only after child with wrong flag is added to the tree.

* Also make these inheritance more clear by using real_parent instead of
current, as on clone(CLONE_PARENT) if current has is_child_subreaper
and real_parent has no is_child_subreaper or has_child_subreaper, child
will have has_child_subreaper flag set without actually having a
subreaper in it's ancestors.

b) When some descendant is child_reaper, it's subtree is in different
pidns from us(original child-subreaper) and processes from other pidns
will never reparent to us.

So we can skip their(a,b) subtree from walk.

v2: switch to walk_process_tree() general helper, move
has_child_subreaper inheritance
v3: remove csr_descendant leftover, change current to real_parent
in has_child_subreaper inheritance
v4: small commit message fix

Signed-off-by: Pavel Tikhomirov 
---
 kernel/fork.c | 10 +++---
 kernel/sys.c  | 22 ++
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 135b7a4..c814e59 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1367,9 +1367,6 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-   sig->has_child_subreaper = current->signal->has_child_subreaper ||
-  current->signal->is_child_subreaper;
-
mutex_init(&sig->cred_guard_mutex);
 
return 0;
@@ -1800,6 +1797,13 @@ static __latent_entropy struct task_struct *copy_process(
 
p->signal->leader_pid = pid;
p->signal->tty = tty_kref_get(current->signal->tty);
+   /*
+* Inherit has_child_subreaper flag under the same
+* tasklist_lock with adding child to the process tree
+* for propagate_has_child_subreaper optimization.
+*/
+   p->signal->has_child_subreaper = 
p->real_parent->signal->has_child_subreaper ||
+
p->real_parent->signal->is_child_subreaper;
list_add_tail(&p->sibling, &p->real_parent->children);
list_add_tail_rcu(&p->tasks, &init_task.tasks);
attach_pid(p, PIDTYPE_PGID);
diff --git a/kernel/sys.c b/kernel/sys.c
index 842914e..0e4d566 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2063,6 +2063,24 @@ static int prctl_get_tid_address(struct task_struct *me, 
int __user **tid_addr)
 }
 #endif
 
+static int propagate_has_child_subreaper(struct task_struct *p, void *data)
+{
+   /*
+* If task has has_child_subreaper - all its decendants
+* already have these flag too and new decendants will
+  

[PATCH 1/2] introduce the walk_process_tree() helper

2017-01-30 Thread Pavel Tikhomirov
From: Oleg Nesterov 

Add the new helper to walk the process tree, the next patch adds a user.
Note that it visits the group leaders only, proc_visitor can do
for_each_thread itself or we can trivially extend walk_process_tree() to
do this.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Pavel Tikhomirov 
---
 include/linux/sched.h |  3 +++
 kernel/fork.c | 32 
 2 files changed, 35 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad3ec9e..7f8ab91 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3067,6 +3067,9 @@ extern bool current_is_single_threaded(void);
 #define for_each_process_thread(p, t)  \
for_each_process(p) for_each_thread(p, t)
 
+typedef int (*proc_visitor)(struct task_struct *p, void *data);
+void walk_process_tree(struct task_struct *top, proc_visitor, void *);
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
return tsk->signal->nr_threads;
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8a..135b7a4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2053,6 +2053,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 }
 #endif
 
+void walk_process_tree(struct task_struct *top, proc_visitor visitor, void 
*data)
+{
+   struct task_struct *leader, *parent, *child;
+   int res;
+
+   read_lock(&tasklist_lock);
+   leader = top = top->group_leader;
+down:
+   for_each_thread(leader, parent) {
+   list_for_each_entry(child, &parent->children, sibling) {
+   res = visitor(child, data);
+   if (res) {
+   if (res < 0)
+   goto out;
+   leader = child;
+   goto down;
+   }
+up:
+   ;
+   }
+   }
+
+   if (leader != top) {
+   child = leader;
+   parent = child->real_parent;
+   leader = parent->group_leader;
+   goto up;
+   }
+out:
+   read_unlock(&tasklist_lock);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
2.9.3



[PATCH v3 2/2] prctl: propagate has_child_subreaper flag to every descendant

2017-01-30 Thread Pavel Tikhomirov
If process forks some children when it has is_child_subreaper
flag enabled they will inherit has_child_subreaper flag - first
group, when is_child_subreaper is disabled forked children will
not inherit it - second group. So child-subreaper does not reparent
all his descendants when their parents die. Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.

To simplify these we can add a propagation of has_child_subreaper
flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
subreaper to setup has_child_subreaper flag.

In common cases when process like systemd first sets itself to
be a child-subreaper and only after that forks its services, we will
have zero-length list of descendants to walk. Testing with binary
subtree of 2^15 processes prctl took < 0.007 sec and has shown close
to linear dependency(~0.2 * n * usec) on lower numbers of processes.

Moreover, I doubt someone intentionaly pre-forks the children whitch
should reparent to previous reaper(e.g. init) before becoming
subreaper, because some our ancestor migh have had is_child_subreaper
flag while forking our sub-tree and our childs will all inherit
has_child_subreaper flag, and we have no way to influence it. And only
way to check if we have no has_child_subreaper flag is to create some
childs, kill them and see where they will reparent to.

Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing
seems to be the same.

Optimize:

a) When descendant already has has_child_subreaper flag all his subtree
has it too already.

* for a) to be true need to move has_child_subreaper inheritance under
the same tasklist_lock with adding task to its ->real_parent->children
as without it process can inherit zero has_child_subreaper, then we
set 1 to it's parent flag, check that it has no more parents, and only
after child with wrong flag is added to the tree.

* also make these inheritance more clear by using real_parent instead of
current, as on clone(CLONE_PARENT) if current has is_child_subreaper
and real_parent has no is_child_subreaper or has_child_subreaper, child
will have has_child_subreaper flag set without actually having a
subreaper in it's ancestors.

b) When some descendant is child_reaper, it's subtree is in different
pidns from us(original child-subreaper) and processes from other pidns
will never reparent to us.

So we can skip their(a,b) subtree from walk.

v2: switch to walk_process_tree() general helper, move
has_child_subreaper inheritance
v3: remove csr_descendant leftover, change current to real_parent
in has_child_subreaper inheritance

Signed-off-by: Pavel Tikhomirov 
---
 kernel/fork.c | 10 +++---
 kernel/sys.c  | 22 ++
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 135b7a4..c814e59 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1367,9 +1367,6 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-   sig->has_child_subreaper = current->signal->has_child_subreaper ||
-  current->signal->is_child_subreaper;
-
mutex_init(&sig->cred_guard_mutex);
 
return 0;
@@ -1800,6 +1797,13 @@ static __latent_entropy struct task_struct *copy_process(
 
p->signal->leader_pid = pid;
p->signal->tty = tty_kref_get(current->signal->tty);
+   /*
+* Inherit has_child_subreaper flag under the same
+* tasklist_lock with adding child to the process tree
+* for propagate_has_child_subreaper optimization.
+*/
+   p->signal->has_child_subreaper = 
p->real_parent->signal->has_child_subreaper ||
+
p->real_parent->signal->is_child_subreaper;
list_add_tail(&p->sibling, &p->real_parent->children);
list_add_tail_rcu(&p->tasks, &init_task.tasks);
attach_pid(p, PIDTYPE_PGID);
diff --git a/kernel/sys.c b/kernel/sys.c
index 842914e..0e4d566 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2063,6 +2063,24 @@ static int prctl_get_tid_address(struct task_struct *me, 
int __user **tid_addr)
 }
 #endif
 
+static int propagate_has_child_subreaper(struct task_struct *p, void *data)
+{
+   /*
+* If task has has_child_subreaper - all its decendants
+* already have these flag too and new decendants will
+* inher

[PATCH 1/2] introduce the walk_process_tree() helper

2017-01-30 Thread Pavel Tikhomirov
From: Oleg Nesterov 

Add the new helper to walk the process tree, the next patch adds a user.
Note that it visits the group leaders only, proc_visitor can do
for_each_thread itself or we can trivially extend walk_process_tree() to
do this.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Pavel Tikhomirov 
---
 include/linux/sched.h |  3 +++
 kernel/fork.c | 32 
 2 files changed, 35 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad3ec9e..7f8ab91 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3067,6 +3067,9 @@ extern bool current_is_single_threaded(void);
 #define for_each_process_thread(p, t)  \
for_each_process(p) for_each_thread(p, t)
 
+typedef int (*proc_visitor)(struct task_struct *p, void *data);
+void walk_process_tree(struct task_struct *top, proc_visitor, void *);
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
return tsk->signal->nr_threads;
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8a..135b7a4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2053,6 +2053,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 }
 #endif
 
+void walk_process_tree(struct task_struct *top, proc_visitor visitor, void 
*data)
+{
+   struct task_struct *leader, *parent, *child;
+   int res;
+
+   read_lock(&tasklist_lock);
+   leader = top = top->group_leader;
+down:
+   for_each_thread(leader, parent) {
+   list_for_each_entry(child, &parent->children, sibling) {
+   res = visitor(child, data);
+   if (res) {
+   if (res < 0)
+   goto out;
+   leader = child;
+   goto down;
+   }
+up:
+   ;
+   }
+   }
+
+   if (leader != top) {
+   child = leader;
+   parent = child->real_parent;
+   leader = parent->group_leader;
+   goto up;
+   }
+out:
+   read_unlock(&tasklist_lock);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
2.9.3



[PATCH v4 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic

2017-01-30 Thread Pavel Tikhomirov
Oleg Nesterov (1):
  introduce the walk_process_tree() helper

Pavel Tikhomirov (1):
  prctl: propagate has_child_subreaper flag to every descendant

 include/linux/sched.h |  3 +++
 kernel/fork.c | 42 +++---
 kernel/sys.c  | 22 ++
 3 files changed, 64 insertions(+), 3 deletions(-)

-- 
2.9.3



[PATCH v3 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic

2017-01-30 Thread Pavel Tikhomirov
Oleg Nesterov (1):
  introduce the walk_process_tree() helper

Pavel Tikhomirov (1):
  prctl: propagate has_child_subreaper flag to every descendant

 include/linux/sched.h |  3 +++
 kernel/fork.c | 42 +++---
 kernel/sys.c  | 22 ++
 3 files changed, 64 insertions(+), 3 deletions(-)

-- 
2.9.3



Re: [PATCH v3 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic

2017-01-30 Thread Pavel Tikhomirov

please drop it, errors in commit message

On 01/30/2017 05:48 PM, Pavel Tikhomirov wrote:

Oleg Nesterov (1):
  introduce the walk_process_tree() helper

Pavel Tikhomirov (1):
  prctl: propagate has_child_subreaper flag to every descendant

 include/linux/sched.h |  3 +++
 kernel/fork.c | 42 +++---
 kernel/sys.c  | 22 ++
 3 files changed, 64 insertions(+), 3 deletions(-)



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH v2 2/2] prctl: propagate has_child_subreaper flag to every descendant

2017-01-30 Thread Pavel Tikhomirov



On 01/30/2017 03:51 PM, Oleg Nesterov wrote:

On 01/27, Pavel Tikhomirov wrote:


--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1725,6 +1725,8 @@ struct task_struct {
struct signal_struct *signal;
struct sighand_struct *sighand;

+   struct list_head csr_descendant;
+


You forgot to remove this part ;)


Oh, sure.




--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1367,9 +1367,6 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;

-   sig->has_child_subreaper = current->signal->has_child_subreaper ||
-  current->signal->is_child_subreaper;
-
mutex_init(&sig->cred_guard_mutex);

return 0;
@@ -1800,6 +1797,13 @@ static __latent_entropy struct task_struct *copy_process(

p->signal->leader_pid = pid;
p->signal->tty = tty_kref_get(current->signal->tty);
+   /*
+* Inherit has_child_subreaper flag under the same
+* tasklist_lock with adding child to the process tree
+* for propagate_has_child_subreaper optimization.
+*/
+   p->signal->has_child_subreaper = 
current->signal->has_child_subreaper ||
+
current->signal->is_child_subreaper;


Ah yes, we need this change too...

And perhaps it would be more correct to do

p->signal->has_child_subreaper =
p->real_parent->signal->has_child_subreaper ||
p->real_parent->signal->is_child_subreaper;

the current code is not exactly right if CLONE_PARENT...


I'm fine with inheriting 'has' flag from real_parent, because if 
real_parent does not have 'has' flag set but parent has 'has' set, we 
inherited the flag in vain.


But I don't actually think that inheritance from parent not real_parent 
breaks my optimization: if real_parent has the flag, so does the parent.




Oleg.



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] prctl.2: Document new PR_SET_CHILD_SUBREAPER semantics

2017-01-27 Thread Pavel Tikhomirov

On 01/28/2017 01:47 AM, Michael Kerrisk (man-pages) wrote:

Hello Pavel,

On 27 January 2017 at 23:11, Pavel Tikhomirov  wrote:

old semantics was non deterministic and worked differently
depending on the external factors, but nothing changes if
process first sets itself subreaper and only after forks


When did the kernel behavior change?


Sorry for inconvenience, should have added you to cc of a main patch also.

Kernel behavior haven't changed yet, doc change will be a continuation 
of "[PATCH v2 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic", 
see https://lkml.org/lkml/2017/1/27/108




Cheers,

Michael



Signed-off-by: Pavel Tikhomirov 
---
 man2/prctl.2 | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/man2/prctl.2 b/man2/prctl.2
index 97cf21a..84fbd7e 100644
--- a/man2/prctl.2
+++ b/man2/prctl.2
@@ -162,20 +162,30 @@ if
 is zero, unset the attribute.

 When a process is marked as a child subreaper,
-all of the children that it creates, and their descendants,
+all of the children that it creates or have created already, and their 
descendants,
 will be marked as having a subreaper.
 In effect, a subreaper fulfills the role of
 .BR init (1)
 for its descendant processes.
-Upon termination of a process
-that is orphaned (i.e., its immediate parent has already terminated)
-and marked as having a subreaper,
-the nearest still living ancestor subreaper
-will receive a
+Upon termination of a process having a subreaper,
+all its children become orphaned
+and will be reparented to the nearest still living ancestor subreaper.
+So that on it's adopted child termination
+these subreaper will receive a
 .BR SIGCHLD
 signal and will be able to
 .BR wait (2)
-on the process to discover its termination status.
+on the child to discover its termination status.
+
+Note, that on older kernels these prctl works slightly different.
+Child subreaper process was not actualy the
+.BR init (1)
+for all its descendants.
+If process forks a child while not been a child subreaper,
+and after sets himself child subreaper,
+sub-tree of the child might or might not reparent to the subreaper,
+depending on the configuration of ancestors of the subreaper,
+at the time of forking our subtree.
 .TP
 .BR PR_GET_CHILD_SUBREAPER " (since Linux 3.4)"
 Return the "child subreaper" setting of the caller,
--
2.9.3







--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


[PATCH] prctl.2: Document new PR_SET_CHILD_SUBREAPER semantics

2017-01-27 Thread Pavel Tikhomirov
old semantics was non deterministic and worked differently
depending on the external factors, but nothing changes if
process first sets itself subreaper and only after forks

Signed-off-by: Pavel Tikhomirov 
---
 man2/prctl.2 | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/man2/prctl.2 b/man2/prctl.2
index 97cf21a..84fbd7e 100644
--- a/man2/prctl.2
+++ b/man2/prctl.2
@@ -162,20 +162,30 @@ if
 is zero, unset the attribute.
 
 When a process is marked as a child subreaper,
-all of the children that it creates, and their descendants,
+all of the children that it creates or have created already, and their 
descendants,
 will be marked as having a subreaper.
 In effect, a subreaper fulfills the role of
 .BR init (1)
 for its descendant processes.
-Upon termination of a process
-that is orphaned (i.e., its immediate parent has already terminated)
-and marked as having a subreaper,
-the nearest still living ancestor subreaper
-will receive a
+Upon termination of a process having a subreaper,
+all its children become orphaned
+and will be reparented to the nearest still living ancestor subreaper.
+So that on it's adopted child termination
+these subreaper will receive a
 .BR SIGCHLD
 signal and will be able to
 .BR wait (2)
-on the process to discover its termination status.
+on the child to discover its termination status.
+
+Note, that on older kernels these prctl works slightly different.
+Child subreaper process was not actualy the
+.BR init (1)
+for all its descendants.
+If process forks a child while not been a child subreaper,
+and after sets himself child subreaper,
+sub-tree of the child might or might not reparent to the subreaper,
+depending on the configuration of ancestors of the subreaper,
+at the time of forking our subtree.
 .TP
 .BR PR_GET_CHILD_SUBREAPER " (since Linux 3.4)"
 Return the "child subreaper" setting of the caller,
-- 
2.9.3



[PATCH v2 2/2] prctl: propagate has_child_subreaper flag to every descendant

2017-01-27 Thread Pavel Tikhomirov
If process forks some children when it has is_child_subreaper
flag enabled they will inherit has_child_subreaper flag - first
group, when is_child_subreaper is disabled forked children will
not inherit it - second group. So child-subreaper does not reparent
all his descendants when their parents die. Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.

To simplify these we can add a propagation of has_child_subreaper
flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
subreaper to setup has_child_subreaper flag.

In common cases when process like systemd first sets itself to
be a child-subreaper and only after that forks its services, we will
have zero-length list of descendants to walk. Testing with binary
subtree of 2^15 processes prctl took < 0.007 sec and has shown close
to linear dependency(~0.2 * n * usec) on lower numbers of processes.

Moreover, I doubt someone intentionaly pre-forks the children whitch
should reparent to init before becoming subreaper, because some our
ancestor migh have had is_child_subreaper flag while forking our
sub-tree and our childs will all inherit has_child_subreaper flag,
and we have no way to influence it. And only way to check if we have
no has_child_subreaper flag is to create some childs, kill them
and see where they will reparent to.

Use walk_process_tree helper to walk subtree, thanks to Oleg! Timing
seems to be the same.

Optimize:

a) When descendant already has has_child_subreaper flag all his subtree
has it too already.

* for a) to be true need to move has_child_subreaper inheritance under
the same tasklist_lock with adding task to its ->real_parent->children
as without it process can inherit zero has_child_subreaper, then we
set 1 to it's parent flag, check that parent has no more children, and
only after child with wrong flag is added to the tree.

b) When some descendant is child_reaper, it's subtree is in different
pidns from us(original child-subreaper) and processes from other pidns
will never reparent to us.

So we can skip their(a,b) subtree from walk.

v2: switch to walk_process_tree() general helper, move
has_child_subreaper inheritance

Signed-off-by: Pavel Tikhomirov 
---
 include/linux/sched.h |  2 ++
 kernel/fork.c | 10 +++---
 kernel/sys.c  | 22 ++
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7f8ab91..a08f006 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1725,6 +1725,8 @@ struct task_struct {
struct signal_struct *signal;
struct sighand_struct *sighand;
 
+   struct list_head csr_descendant;
+
sigset_t blocked, real_blocked;
sigset_t saved_sigmask; /* restored if set_restore_sigmask() was used */
struct sigpending pending;
diff --git a/kernel/fork.c b/kernel/fork.c
index 135b7a4..5874d01 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1367,9 +1367,6 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-   sig->has_child_subreaper = current->signal->has_child_subreaper ||
-  current->signal->is_child_subreaper;
-
mutex_init(&sig->cred_guard_mutex);
 
return 0;
@@ -1800,6 +1797,13 @@ static __latent_entropy struct task_struct *copy_process(
 
p->signal->leader_pid = pid;
p->signal->tty = tty_kref_get(current->signal->tty);
+   /*
+* Inherit has_child_subreaper flag under the same
+* tasklist_lock with adding child to the process tree
+* for propagate_has_child_subreaper optimization.
+*/
+   p->signal->has_child_subreaper = 
current->signal->has_child_subreaper ||
+
current->signal->is_child_subreaper;
list_add_tail(&p->sibling, &p->real_parent->children);
list_add_tail_rcu(&p->tasks, &init_task.tasks);
attach_pid(p, PIDTYPE_PGID);
diff --git a/kernel/sys.c b/kernel/sys.c
index 842914e..0e4d566 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2063,6 +2063,24 @@ static int prctl_get_tid_address(struct task_struct *me, 
int __user **tid_addr)
 }
 #endif
 
+static int propagate_has_child_subreaper(struct task_struct *p, void *data)
+{
+   /*
+* If task has has_child_subreaper - all its decendants
+* already

[PATCH 1/2] introduce the walk_process_tree() helper

2017-01-27 Thread Pavel Tikhomirov
From: Oleg Nesterov 

Add the new helper to walk the process tree, the next patch adds a user.
Note that it visits the group leaders only, proc_visitor can do
for_each_thread itself or we can trivially extend walk_process_tree() to
do this.

Signed-off-by: Oleg Nesterov 
Reviewed-by: Pavel Tikhomirov 
---
 include/linux/sched.h |  3 +++
 kernel/fork.c | 32 
 2 files changed, 35 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad3ec9e..7f8ab91 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3067,6 +3067,9 @@ extern bool current_is_single_threaded(void);
 #define for_each_process_thread(p, t)  \
for_each_process(p) for_each_thread(p, t)
 
+typedef int (*proc_visitor)(struct task_struct *p, void *data);
+void walk_process_tree(struct task_struct *top, proc_visitor, void *);
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
return tsk->signal->nr_threads;
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8a..135b7a4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2053,6 +2053,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 }
 #endif
 
+void walk_process_tree(struct task_struct *top, proc_visitor visitor, void 
*data)
+{
+   struct task_struct *leader, *parent, *child;
+   int res;
+
+   read_lock(&tasklist_lock);
+   leader = top = top->group_leader;
+down:
+   for_each_thread(leader, parent) {
+   list_for_each_entry(child, &parent->children, sibling) {
+   res = visitor(child, data);
+   if (res) {
+   if (res < 0)
+   goto out;
+   leader = child;
+   goto down;
+   }
+up:
+   ;
+   }
+   }
+
+   if (leader != top) {
+   child = leader;
+   parent = child->real_parent;
+   leader = parent->group_leader;
+   goto up;
+   }
+out:
+   read_unlock(&tasklist_lock);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
2.9.3



[PATCH v2 0/2] prctl: make PR_SET_CHILD_SUBREAPER deterministic

2017-01-27 Thread Pavel Tikhomirov
will send documentation change proposal in reply to these letter

Oleg Nesterov (1):
  introduce the walk_process_tree() helper

Pavel Tikhomirov (1):
  prctl: propagate has_child_subreaper flag to every descendant

 include/linux/sched.h |  5 +
 kernel/fork.c | 42 +++---
 kernel/sys.c  | 22 ++
 3 files changed, 66 insertions(+), 3 deletions(-)

-- 
2.9.3



Re: [PATCH] introduce the walk_process_tree() helper

2017-01-24 Thread Pavel Tikhomirov


Will include it in patch-set with documentation fix. Thanks Oleg!

Reviewed-by: Pavel Tikhomirov 

On 01/23/2017 02:57 PM, Oleg Nesterov wrote:

Add the new helper to walk the process tree, the next patch adds a user.
Note that it visits the group leaders only, proc_visitor can do
for_each_thread itself or we can trivially extend walk_process_tree() to
do this.

Signed-off-by: Oleg Nesterov 
---
 include/linux/sched.h |  3 +++
 kernel/fork.c | 32 
 2 files changed, 35 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e9c009d..681f748 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3015,6 +3015,9 @@ extern bool current_is_single_threaded(void);
 #define for_each_process_thread(p, t)  \
for_each_process(p) for_each_thread(p, t)

+typedef int (*proc_visitor)(struct task_struct *p, void *data);
+void walk_process_tree(struct task_struct *top, proc_visitor, void *);
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
return tsk->signal->nr_threads;
diff --git a/kernel/fork.c b/kernel/fork.c
index 663c6a7..2522bae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2042,6 +2042,38 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 }
 #endif

+void walk_process_tree(struct task_struct *top, proc_visitor visitor, void 
*data)
+{
+   struct task_struct *leader, *parent, *child;
+   int res;
+
+   read_lock(&tasklist_lock);
+   leader = top = top->group_leader;
+down:
+   for_each_thread(leader, parent) {
+   list_for_each_entry(child, &parent->children, sibling) {
+   res = visitor(child, data);
+   if (res) {
+   if (res < 0)
+   goto out;
+   leader = child;
+   goto down;
+   }
+up:
+   ;
+   }
+   }
+
+   if (leader != top) {
+   child = leader;
+   parent = child->real_parent;
+   leader = parent->group_leader;
+   goto up;
+   }
+out:
+   read_unlock(&tasklist_lock);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant

2017-01-23 Thread Pavel Tikhomirov

Add to cc Lennart Poettering 

On 01/23/2017 02:55 PM, Oleg Nesterov wrote:

On 01/22, Pavel Tikhomirov wrote:




Hmm. could you explain how this change helps CRIU? I mean, why
restorer can't do prctl(CHILD_SUBREAPER) before the first fork?


Imagine we have these tree in pidns:

1: has_child_subreaper == 0 && is_child_subreaper == 0
|-2: has_child_subreaper == 0 && is_child_subreaper == 1
| |-3: has_child_subreaper == 0 && is_child_subreaper == 0
| | |-5: has_child_subreaper == 0 && is_child_subreaper == 0
| |-4: has_child_subreaper == 1 && is_child_subreaper == 0
| | |-6: has_child_subreaper == 1 && is_child_subreaper == 0

before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1.
after c/r: (where restorer had is_child_subreaper == 1, everybody in the
tree will have has_child_subreaper == 1) Everybody will reparent to 2.


This is clear, but this can only happen if 2 forks 3 and after that
sets is_child_subreaper, right?

And if someone actually does this then your patch can break this
application, no?

IOW. Currently CRIU can't restore the process tree with the same
has_child_subreaper bits if some process forks before
prctl(PR_SET_CHILD_SUBREAPER). It restores the tree as if prctl()
was called before the 1st fork.

So you change the semantics of PR_SET_CHILD_SUBREAPER and now CRIU
is fine simply because you remove this feature: the sub-reaper can
no longer pre-fork the children which should reparent to the previous
reaper.

I won't really argure, but I am not sure this is good idea...


If one task uses these feature now it must be very carefull: if some our 
ancestor have enabled is_child_subreaper somewhere up the tree, forked 
our tree and after that disabled is_child_subreaper, so we already have 
has-flag and all children will inherit has-flag irrelevant to what is 
our order of fork/prctl-ing to become subreaper.


And only one way to check if the task has no has_child_subreaper flag is 
to create some childs kill them and see to where they will reparent, but 
I doubt that someone is doing these now.



At least I think this should be clearly documented.


Yes, I surely need to add some documentation here. Thanks for mentioning 
that! - Will do.





You don't need this new member and descendants_lock. task_struct has
the ->real_parent pointer so you can work the tree without recursion.


Sorry I don't get how I can walk down the tree of all descendants with help
of ->real_parent pointer, can you please point on some example or explain a
bit more? (I see task_is_descendant() in security/yama/yama_lsm.c but we
will need to check it for every process, not only descendants, the latter
can be a lot faster.)


I'll send a patch, probably a generic helper makes sense.

Btw task_is_descendant() looks wrong at first glance.

Oleg.



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant

2017-01-22 Thread Pavel Tikhomirov

Sorry I had some problem with mail-agent, resend to be on the safe side.

On 01/20/2017 09:14 PM, Oleg Nesterov wrote:

On 01/19, Pavel Tikhomirov wrote:


Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.


Hmm. could you explain how this change helps CRIU? I mean, why
restorer can't do prctl(CHILD_SUBREAPER) before the first fork?


Imagine we have these tree in pidns:

1: has_child_subreaper == 0 && is_child_subreaper == 0
|-2: has_child_subreaper == 0 && is_child_subreaper == 1
| |-3: has_child_subreaper == 0 && is_child_subreaper == 0
| | |-5: has_child_subreaper == 0 && is_child_subreaper == 0
| |-4: has_child_subreaper == 1 && is_child_subreaper == 0
| | |-6: has_child_subreaper == 1 && is_child_subreaper == 0

before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1.
after c/r: (where restorer had is_child_subreaper == 1, everybody in the 
tree will have has_child_subreaper == 1) Everybody will reparent to 2.




Anyway, afaics the patch is sub-optimal and not correct...


--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1715,6 +1715,8 @@ struct task_struct {
struct signal_struct *signal;
struct sighand_struct *sighand;

+   struct list_head csr_descendant;
+


You don't need this new member and descendants_lock. task_struct has
the ->real_parent pointer so you can work the tree without recursion.


Sorry I don't get how I can walk down the tree of all descendants with 
help of ->real_parent pointer, can you please point on some example or 
explain a bit more? (I see task_is_descendant() in 
security/yama/yama_lsm.c but we will need to check it for every process, 
not only descendants, the latter can be a lot faster.)





+static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2)
+{
+   LIST_HEAD(descendants);
+
+   reaper->signal->is_child_subreaper = arg2;
+   if (!arg2)
+   return;
+
+   spin_lock(&descendants_lock);
+   read_lock(&tasklist_lock);
+
+   list_add(&reaper->csr_descendant, &descendants);
+
+   while (!list_empty(&descendants)) {
+   struct task_struct *tsk;
+   struct task_struct *p;
+
+   tsk = list_first_entry(&descendants, struct task_struct,
+   csr_descendant);
+
+   list_for_each_entry(p, &tsk->children, sibling) {


This is not enough. Every thread has its own ->children list, you need
to walk the sub-threads as well.


Will do.




+* If we've found child_reaper - skip descendants in
+* it's subtree as they will never get out pidns
+*/
+   if (is_child_reaper(task_pid(p)))
+   continue;


Again, a child reaper can be multi-threaded, this check can be false
negative.

Probably is_child_reaper() should be renamed somehow and a new helper
makes sense... something like


Will do.



bool task_is_child_reaper(struct task_struct *p)
{
return same_thread_group(p, 
task_active_pid_ns(p)->child_reaper);
}

Oleg.



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH] prctl: propagate has_child_subreaper flag to every descendant

2017-01-22 Thread Pavel Tikhomirov



On 01/20/2017 09:14 PM, Oleg Nesterov wrote:

On 01/19, Pavel Tikhomirov wrote:


Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.


Hmm. could you explain how this change helps CRIU? I mean, why
restorer can't do prctl(CHILD_SUBREAPER) before the first fork?


Imagine we have these tree in pidns:

1: has_child_subreaper == 0 && is_child_subreaper == 0
|-2: has_child_subreaper == 0 && is_child_subreaper == 1
| |-3: has_child_subreaper == 0 && is_child_subreaper == 0
| | |-5: has_child_subreaper == 0 && is_child_subreaper == 0
| |-4: has_child_subreaper == 1 && is_child_subreaper == 0
| | |-6: has_child_subreaper == 1 && is_child_subreaper == 0

before c/r: If 4 dies 6 will reparent to 2, if 3 dies 5 will reparent to 1.
after c/r: (where restorer had is_child_subreaper == 1, everybody in the 
tree will have has_child_subreaper == 1) Everybody will reparent to 2.




Anyway, afaics the patch is sub-optimal and not correct...


--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1715,6 +1715,8 @@ struct task_struct {
struct signal_struct *signal;
struct sighand_struct *sighand;

+   struct list_head csr_descendant;
+


You don't need this new member and descendants_lock. task_struct has
the ->real_parent pointer so you can work the tree without recursion.


Sorry I don't get how I can walk down the tree of all descendants with 
help of ->real_parent pointer, can you please point on some example or 
explain a bit more? (I see task_is_descendant() in 
security/yama/yama_lsm.c but we will need to check it for every process, 
not only descendants, the latter can be a lot faster.)





+static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2)
+{
+   LIST_HEAD(descendants);
+
+   reaper->signal->is_child_subreaper = arg2;
+   if (!arg2)
+   return;
+
+   spin_lock(&descendants_lock);
+   read_lock(&tasklist_lock);
+
+   list_add(&reaper->csr_descendant, &descendants);
+
+   while (!list_empty(&descendants)) {
+   struct task_struct *tsk;
+   struct task_struct *p;
+
+   tsk = list_first_entry(&descendants, struct task_struct,
+   csr_descendant);
+
+   list_for_each_entry(p, &tsk->children, sibling) {


This is not enough. Every thread has its own ->children list, you need
to walk the sub-threads as well.


Will do.




+* If we've found child_reaper - skip descendants in
+* it's subtree as they will never get out pidns
+*/
+   if (is_child_reaper(task_pid(p)))
+   continue;


Again, a child reaper can be multi-threaded, this check can be false
negative.

Probably is_child_reaper() should be renamed somehow and a new helper
makes sense... something like


Will do.



bool task_is_child_reaper(struct task_struct *p)
{
return same_thread_group(p, 
task_active_pid_ns(p)->child_reaper);
}

Oleg.



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


[PATCH] prctl: propagate has_child_subreaper flag to every descendant

2017-01-19 Thread Pavel Tikhomirov
If process forks some children when it has is_child_subreaper
flag enabled they will inherit has_child_subreaper flag - first
group, when is_child_subreaper is disabled forked children will
not inherit it - second group. So child-subreaper does not reparent
all his descendants when their parents die. Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.

To simplify these we can add a propagation of has_child_subreaper
flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
subreaper to setup has_child_subreaper flag.

In common cases when process like systemd first sets itself to
be a child-subreaper and only after that forks its services, we will
have zero-length list of descendants to walk. Testing with binary
subtree of 2^15 processes prctl took < 0.007 sec and has shown close
to linear dependency(~0.2 * n * usec) on lower numbers of processes.

Using csr_descendant list to collect descendants and do tree walk
without recursion.

Optimize:

a) When descendant already has has_child_subreaper flag all his subtree
has it too already.

b) When some descendant is child_reaper, it's subtree is in different
pidns from us(original child-subreaper) and processes from other pidns
will never reparent to us.

So we can skip their(a,b) subtree from walk.

Signed-off-by: Pavel Tikhomirov 
---
 include/linux/sched.h |  2 ++
 kernel/sys.c  | 51 ++-
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4d19052..9cb44c4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1715,6 +1715,8 @@ struct task_struct {
struct signal_struct *signal;
struct sighand_struct *sighand;
 
+   struct list_head csr_descendant;
+
sigset_t blocked, real_blocked;
sigset_t saved_sigmask; /* restored if set_restore_sigmask() was used */
struct sigpending pending;
diff --git a/kernel/sys.c b/kernel/sys.c
index 842914e..05b6d7d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2063,6 +2063,55 @@ static int prctl_get_tid_address(struct task_struct *me, 
int __user **tid_addr)
 }
 #endif
 
+static DEFINE_SPINLOCK(descendants_lock);
+
+static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2)
+{
+   LIST_HEAD(descendants);
+
+   reaper->signal->is_child_subreaper = arg2;
+   if (!arg2)
+   return;
+
+   spin_lock(&descendants_lock);
+   read_lock(&tasklist_lock);
+
+   list_add(&reaper->csr_descendant, &descendants);
+
+   while (!list_empty(&descendants)) {
+   struct task_struct *tsk;
+   struct task_struct *p;
+
+   tsk = list_first_entry(&descendants, struct task_struct,
+   csr_descendant);
+
+   list_for_each_entry(p, &tsk->children, sibling) {
+   /*
+* If tsk has has_child_subreaper - all its decendants
+* already have these flag too and new decendants will
+* inherit it on fork, so nothing to be done here.
+*/
+   if (p->signal->has_child_subreaper)
+   continue;
+
+   /*
+* If we've found child_reaper - skip descendants in
+* it's subtree as they will never get out pidns
+*/
+   if (is_child_reaper(task_pid(p)))
+   continue;
+
+   p->signal->has_child_subreaper = 1;
+   list_add(&p->csr_descendant, &descendants);
+   }
+
+   list_del_init(&tsk->csr_descendant);
+   }
+
+   read_unlock(&tasklist_lock);
+   spin_unlock(&descendants_lock);
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
 {
@@ -2213,7 +2262,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
error = prctl_get_tid_address(me, (int __user **)arg2);
break;
case PR_SET_CHILD_SUBREAPER:
-   me->signal->is_child_subreaper = !!arg2;
+   prctl_set_child_subreaper(me, !!arg2);
break;
case PR_GET_CHILD_SUBREAPER:
error = put_user(me->signal->is_child_subreaper,
-- 
2.9.3



[PATCH v2] ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int

2017-01-08 Thread Pavel Tikhomirov
> cat /proc/sys/net/ipv4/tcp_notsent_lowat
-1
> echo 4294967295 > /proc/sys/net/ipv4/tcp_notsent_lowat
-bash: echo: write error: Invalid argument
> echo -2147483648 > /proc/sys/net/ipv4/tcp_notsent_lowat
> cat /proc/sys/net/ipv4/tcp_notsent_lowat
-2147483648

but in documentation we have "tcp_notsent_lowat - UNSIGNED INTEGER"

v2: simplify to just proc_douintvec
Signed-off-by: Pavel Tikhomirov 
---
 net/ipv4/sysctl_net_ipv4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 80bc36b..566cfc5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -958,7 +958,7 @@ static struct ctl_table ipv4_net_table[] = {
.data   = &init_net.ipv4.sysctl_tcp_notsent_lowat,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_douintvec,
},
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
{
-- 
2.9.3



[PATCH] ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int

2016-12-29 Thread Pavel Tikhomirov
> cat /proc/sys/net/ipv4/tcp_notsent_lowat
-1
> echo 4294967295 > /proc/sys/net/ipv4/tcp_notsent_lowat
-bash: echo: write error: Invalid argument
> echo -2147483648 > /proc/sys/net/ipv4/tcp_notsent_lowat
> cat /proc/sys/net/ipv4/tcp_notsent_lowat
-2147483648

but in documentation we have "tcp_notsent_lowat - UNSIGNED INTEGER"

Signed-off-by: Pavel Tikhomirov 
---
 net/ipv4/sysctl_net_ipv4.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 80bc36b..5361373 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static unsigned int uint_max = UINT_MAX;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -958,7 +959,9 @@ static struct ctl_table ipv4_net_table[] = {
.data   = &init_net.ipv4.sysctl_tcp_notsent_lowat,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_doulongvec_minmax,
+   .extra1 = &zero,
+   .extra2 = &uint_max,
},
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
{
-- 
2.9.3



[PATCH v2] netfilter: nf_log: fix error on write NONE to logger choice sysctl

2016-07-01 Thread Pavel Tikhomirov
It is hard to unbind nf-logger:

  echo NONE > /proc/sys/net/netfilter/nf_log/0
  bash: echo: write error: No such file or directory

  sysctl -w net.netfilter.nf_log.0=NONE
  sysctl: setting key "net.netfilter.nf_log.0": No such file or directory
  net.netfilter.nf_log.0 = NONE

You need explicitly send '\0', for instance like:

  echo -e "NONE\0" > /proc/sys/net/netfilter/nf_log/0

That seem to be strange, so fix it using proc_dostring.

Now it works fine:
   modprobe nfnetlink_log
   echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/0
   cat /proc/sys/net/netfilter/nf_log/0
   nfnetlink_log
   echo NONE > /proc/sys/net/netfilter/nf_log/0
   cat /proc/sys/net/netfilter/nf_log/0
   NONE

v2: add missed error check for proc_dostring

Signed-off-by: Pavel Tikhomirov 
---
 net/netfilter/nf_log.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index a5d41df..58ea5b0 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -398,16 +398,17 @@ static int nf_log_proc_dostring(struct ctl_table *table, 
int write,
 {
const struct nf_logger *logger;
char buf[NFLOGGER_NAME_LEN];
-   size_t size = *lenp;
int r = 0;
int tindex = (unsigned long)table->extra1;
struct net *net = current->nsproxy->net_ns;
 
if (write) {
-   if (size > sizeof(buf))
-   size = sizeof(buf);
-   if (copy_from_user(buf, buffer, size))
-   return -EFAULT;
+   struct ctl_table tmp = *table;
+
+   tmp.data = buf;
+   r = proc_dostring(&tmp, write, buffer, lenp, ppos);
+   if (r)
+   return r;
 
if (!strcmp(buf, "NONE")) {
nf_log_unbind_pf(net, tindex);
-- 
2.5.5



[PATCH] netfilter: nf_log: fix error on write NONE to logger choice sysctl

2016-07-01 Thread Pavel Tikhomirov
It is hard to unbind nf-logger:

  echo NONE > /proc/sys/net/netfilter/nf_log/0
  bash: echo: write error: No such file or directory

  sysctl -w net.netfilter.nf_log.0=NONE
  sysctl: setting key "net.netfilter.nf_log.0": No such file or directory
  net.netfilter.nf_log.0 = NONE

You need explicitly send '\0', for instance like:

  echo -e "NONE\0" > /proc/sys/net/netfilter/nf_log/0

That seem to be strange, so fix it using proc_dostring.

Now it works fine:
   modprobe nfnetlink_log
   echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/0
   cat /proc/sys/net/netfilter/nf_log/0
   nfnetlink_log
   echo NONE > /proc/sys/net/netfilter/nf_log/0
   cat /proc/sys/net/netfilter/nf_log/0
   NONE

Signed-off-by: Pavel Tikhomirov 
---
 net/netfilter/nf_log.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index a5d41df..c7c5882 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -398,16 +398,15 @@ static int nf_log_proc_dostring(struct ctl_table *table, 
int write,
 {
const struct nf_logger *logger;
char buf[NFLOGGER_NAME_LEN];
-   size_t size = *lenp;
int r = 0;
int tindex = (unsigned long)table->extra1;
struct net *net = current->nsproxy->net_ns;
 
if (write) {
-   if (size > sizeof(buf))
-   size = sizeof(buf);
-   if (copy_from_user(buf, buffer, size))
-   return -EFAULT;
+   struct ctl_table tmp = *table;
+
+   tmp.data = buf;
+   proc_dostring(&tmp, write, buffer, lenp, ppos);
 
if (!strcmp(buf, "NONE")) {
nf_log_unbind_pf(net, tindex);
-- 
2.5.5



Re: BUG: KASAN: slab-out-of-bounds in ses_enclosure_data_process+0x900/0xe50

2015-12-03 Thread Pavel Tikhomirov
I have very similar problem with SAS2X28, please take a look on a bug 
report here https://bugzilla.kernel.org/show_bug.cgi?id=108771


Thanks, Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:timers/urgent] posix-timers: Show clock ID in proc file

2013-05-28 Thread tip-bot for Pavel Tikhomirov
Commit-ID:  15ef0298deb3929eb6ad6d2334fd2059fd53807c
Gitweb: http://git.kernel.org/tip/15ef0298deb3929eb6ad6d2334fd2059fd53807c
Author: Pavel Tikhomirov 
AuthorDate: Fri, 17 May 2013 02:12:03 +0400
Committer:  Thomas Gleixner 
CommitDate: Tue, 28 May 2013 11:41:14 +0200

posix-timers: Show clock ID in proc file

Expand information about posix-timers in /proc//timers by adding
info about clock, with which the timer was created. I.e. in the forth
line of timer info after "notify:" line go "ClockID: ".

Signed-off-by: Pavel Tikhomirov 
Cc: Michael Kerrisk 
Cc: Matthew Helsley 
Cc: Pavel Emelyanov 
Link: 
http://lkml.kernel.org/r/1368742323-46949-2-git-send-email-snor...@gmail.com
Signed-off-by: Thomas Gleixner 
---
 fs/proc/base.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index dd51e50..c3834da 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2118,6 +2118,7 @@ static int show_timer(struct seq_file *m, void *v)
nstr[notify & ~SIGEV_THREAD_ID],
(notify & SIGEV_THREAD_ID) ? "tid" : "pid",
pid_nr_ns(timer->it_pid, tp->ns));
+   seq_printf(m, "ClockID: %d\n", timer->it_clock);
 
return 0;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] posix-timers: Show clock ID in proc file

2013-05-17 Thread Pavel Tikhomirov
Expand information about posix-timers in /proc//timers by adding
info about clock, with which the timer was created. I.e. in the forth
line of timer info after "notify:" line go "ClockID: ".

Signed-off-by: Pavel Tikhomirov 
---
 fs/proc/base.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2dad4a9..8a38eef 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2079,6 +2079,7 @@ static int show_timer(struct seq_file *m, void *v)
nstr[notify & ~SIGEV_THREAD_ID],
(notify & SIGEV_THREAD_ID) ? "tid" : "pid",
pid_nr_ns(timer->it_pid, tp->ns));
+   seq_printf(m, "ClockID: %d\n", timer->it_clock);
 
return 0;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/1] posix timers: Expand exitsting info in proc file

2013-05-17 Thread Pavel Tikhomirov
Hi.

I'm working on the checkpoint-restore project (http://criu.org), on
realisation of posix-timers. To compleatly checkpoint and restore these
timers we need to know which clock they are using. So I d'like to add
this information to existing syscall which shows posix-timers info.

Pavel Tikhomirov (1):
  posix-timers: Show clock ID in proc file

 fs/proc/base.c |1 +
 1 file changed, 1 insertion(+)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/