Re: [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation
14.09.2017 13:45, Ian Kent пишет: > On 14/09/17 19:39, Stanislav Kinsburskiy wrote: >> >> >> 14.09.2017 13:29, Ian Kent пишет: >>> On 14/09/17 17:24, Stanislav Kinsburskiy wrote: >>>> >>>> >>>> 14.09.2017 02:38, Ian Kent пишет: >>>>> On 01/09/17 19:21, Stanislav Kinsburskiy wrote: >>>>>> Signed-off-by: Stanislav Kinsburskiy >>>>>> --- >>>>>> fs/autofs4/autofs_i.h |3 +++ >>>>>> fs/autofs4/dev-ioctl.c |3 +++ >>>>>> fs/autofs4/inode.c |4 +++- >>>>>> 3 files changed, 9 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h >>>>>> index 4737615..3da105f 100644 >>>>>> --- a/fs/autofs4/autofs_i.h >>>>>> +++ b/fs/autofs4/autofs_i.h >>>>>> @@ -120,6 +120,9 @@ struct autofs_sb_info { >>>>>> struct list_head active_list; >>>>>> struct list_head expiring_list; >>>>>> struct rcu_head rcu; >>>>>> +#ifdef CONFIG_COMPAT >>>>>> +unsigned is32bit:1; >>>>>> +#endif >>>>>> }; >>>>>> >>>>>> static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb) >>>>>> diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c >>>>>> index b7c816f..467d6c4 100644 >>>>>> --- a/fs/autofs4/dev-ioctl.c >>>>>> +++ b/fs/autofs4/dev-ioctl.c >>>>>> @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file >>>>>> *fp, >>>>>> sbi->pipefd = pipefd; >>>>>> sbi->pipe = pipe; >>>>>> sbi->catatonic = 0; >>>>>> +#ifdef CONFIG_COMPAT >>>>>> +sbi->is32bit = is_compat_task(); >>>>>> +#endif >>>>>> } >>>>>> out: >>>>>> put_pid(new_pid); >>>>>> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c >>>>>> index 09e7d68..21d3c0b 100644 >>>>>> --- a/fs/autofs4/inode.c >>>>>> +++ b/fs/autofs4/inode.c >>>>>> @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void >>>>>> *data, int silent) >>>>>> } else { >>>>>> sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID); >>>>>> } >>>>>> - >>>>>> +#ifdef CONFIG_COMPAT >>>>>> +sbi->is32bit = is_compat_task(); >>>>>> +#endif >>>>>> if (autofs_type_trigger(sbi->type)) >>>>>> __managed_dentry_set_managed(root); >>>>>> >>>>>> >>>>> >>>>> Not sure about this. >>>>> >>>>> Don't you think it would be better to avoid the in code #ifdefs by doing >>>>> some >>>>> checks and defines in the header file and defining what's need to just use >>>>> is_compat_task(). >>>>> >>>> >>>> Yes, might be... >>>> >>>>> Not sure 2 patches are needed for this either .. >>>>> >>>> >>>> Well, I found this issue occasionally. >>> >>> I'm wondering what the symptoms are? >>> >> >> Size of struct autofs_v5_packet is 300 bytes for x86 and 304 bytes for >> x86_64. >> Which means, that 32bit task can read more than size of autofs_v5_packet on >> 64bit kernel. > > Are you sure? > > Shouldn't that be a short read on the x86 side of a 4 bytes longer > structure on the x86_64 side. > > I didn't think you could have a 64 bit client on a 32 bit kernel > so the converse (the read past end of struct) doesn't apply. > Sorry for the confusion, I had to add brackets like this: > Which means, that 32bit task can read more than size of autofs_v5_packet (on > 64bit kernel). IOW, 32bit task expects to read 300 bytes (size of struct autofs_v5_packet) while there are 304 bytes available on the "wire" from the 64bit kernel. > Ian >
Re: [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation
14.09.2017 13:29, Ian Kent пишет: > On 14/09/17 17:24, Stanislav Kinsburskiy wrote: >> >> >> 14.09.2017 02:38, Ian Kent пишет: >>> On 01/09/17 19:21, Stanislav Kinsburskiy wrote: >>>> Signed-off-by: Stanislav Kinsburskiy >>>> --- >>>> fs/autofs4/autofs_i.h |3 +++ >>>> fs/autofs4/dev-ioctl.c |3 +++ >>>> fs/autofs4/inode.c |4 +++- >>>> 3 files changed, 9 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h >>>> index 4737615..3da105f 100644 >>>> --- a/fs/autofs4/autofs_i.h >>>> +++ b/fs/autofs4/autofs_i.h >>>> @@ -120,6 +120,9 @@ struct autofs_sb_info { >>>>struct list_head active_list; >>>>struct list_head expiring_list; >>>>struct rcu_head rcu; >>>> +#ifdef CONFIG_COMPAT >>>> + unsigned is32bit:1; >>>> +#endif >>>> }; >>>> >>>> static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb) >>>> diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c >>>> index b7c816f..467d6c4 100644 >>>> --- a/fs/autofs4/dev-ioctl.c >>>> +++ b/fs/autofs4/dev-ioctl.c >>>> @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp, >>>>sbi->pipefd = pipefd; >>>>sbi->pipe = pipe; >>>>sbi->catatonic = 0; >>>> +#ifdef CONFIG_COMPAT >>>> + sbi->is32bit = is_compat_task(); >>>> +#endif >>>>} >>>> out: >>>>put_pid(new_pid); >>>> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c >>>> index 09e7d68..21d3c0b 100644 >>>> --- a/fs/autofs4/inode.c >>>> +++ b/fs/autofs4/inode.c >>>> @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void >>>> *data, int silent) >>>>} else { >>>>sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID); >>>>} >>>> - >>>> +#ifdef CONFIG_COMPAT >>>> + sbi->is32bit = is_compat_task(); >>>> +#endif >>>>if (autofs_type_trigger(sbi->type)) >>>>__managed_dentry_set_managed(root); >>>> >>>> >>> >>> Not sure about this. >>> >>> Don't you think it would be better to avoid the in code #ifdefs by doing >>> some >>> checks and defines in the header file and defining what's need to just use >>> is_compat_task(). >>> >> >> Yes, might be... >> >>> Not sure 2 patches are needed for this either .. >>> >> >> Well, I found this issue occasionally. > > I'm wondering what the symptoms are? > Size of struct autofs_v5_packet is 300 bytes for x86 and 304 bytes for x86_64. Which means, that 32bit task can read more than size of autofs_v5_packet on 64bit kernel. >> And, frankly speaking, it's not clear to me, whether this issue is important >> at all, so I wanted to clarify this first. >> Thanks to O_DIRECT, the only way to catch the issue is to try to read more, >> than expected, in compat task (that's how I found it). > > Right, the O_DIRECT patch from Linus was expected to fix the structure > alignment problem. The stuct field offsets are ok aren't they? > Yes, they are ok. >> I don't see any other flaw so far. And if so, that, probably, we shouldn't >> care about the issue at all. >> What do you think? > > If we are seeing hangs, incorrect struct fields or similar something > should be done about it but if all is actually working ok then the > O_DIRECT fix is doing it's job and further changes aren't necessary. > Well, yes. O_DIRECT fix covers the issue. Ok then. Thanks for the clarification! > Ian >
Re: [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation
14.09.2017 02:38, Ian Kent пишет: > On 01/09/17 19:21, Stanislav Kinsburskiy wrote: >> Signed-off-by: Stanislav Kinsburskiy >> --- >> fs/autofs4/autofs_i.h |3 +++ >> fs/autofs4/dev-ioctl.c |3 +++ >> fs/autofs4/inode.c |4 +++- >> 3 files changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h >> index 4737615..3da105f 100644 >> --- a/fs/autofs4/autofs_i.h >> +++ b/fs/autofs4/autofs_i.h >> @@ -120,6 +120,9 @@ struct autofs_sb_info { >> struct list_head active_list; >> struct list_head expiring_list; >> struct rcu_head rcu; >> +#ifdef CONFIG_COMPAT >> +unsigned is32bit:1; >> +#endif >> }; >> >> static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb) >> diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c >> index b7c816f..467d6c4 100644 >> --- a/fs/autofs4/dev-ioctl.c >> +++ b/fs/autofs4/dev-ioctl.c >> @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp, >> sbi->pipefd = pipefd; >> sbi->pipe = pipe; >> sbi->catatonic = 0; >> +#ifdef CONFIG_COMPAT >> +sbi->is32bit = is_compat_task(); >> +#endif >> } >> out: >> put_pid(new_pid); >> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c >> index 09e7d68..21d3c0b 100644 >> --- a/fs/autofs4/inode.c >> +++ b/fs/autofs4/inode.c >> @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void >> *data, int silent) >> } else { >> sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID); >> } >> - >> +#ifdef CONFIG_COMPAT >> +sbi->is32bit = is_compat_task(); >> +#endif >> if (autofs_type_trigger(sbi->type)) >> __managed_dentry_set_managed(root); >> >> > > Not sure about this. > > Don't you think it would be better to avoid the in code #ifdefs by doing some > checks and defines in the header file and defining what's need to just use > is_compat_task(). > Yes, might be... > Not sure 2 patches are needed for this either .. > Well, I found this issue occasionally. And, frankly speaking, it's not clear to me, whether this issue is important at all, so I wanted to clarify this first. Thanks to O_DIRECT, the only way to catch the issue is to try to read more, than expected, in compat task (that's how I found it). I don't see any other flaw so far. And if so, that, probably, we shouldn't care about the issue at all. What do you think? > Ian >
[RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation
Signed-off-by: Stanislav Kinsburskiy --- fs/autofs4/autofs_i.h |3 +++ fs/autofs4/dev-ioctl.c |3 +++ fs/autofs4/inode.c |4 +++- 3 files changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h index 4737615..3da105f 100644 --- a/fs/autofs4/autofs_i.h +++ b/fs/autofs4/autofs_i.h @@ -120,6 +120,9 @@ struct autofs_sb_info { struct list_head active_list; struct list_head expiring_list; struct rcu_head rcu; +#ifdef CONFIG_COMPAT + unsigned is32bit:1; +#endif }; static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb) diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c index b7c816f..467d6c4 100644 --- a/fs/autofs4/dev-ioctl.c +++ b/fs/autofs4/dev-ioctl.c @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp, sbi->pipefd = pipefd; sbi->pipe = pipe; sbi->catatonic = 0; +#ifdef CONFIG_COMPAT + sbi->is32bit = is_compat_task(); +#endif } out: put_pid(new_pid); diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c index 09e7d68..21d3c0b 100644 --- a/fs/autofs4/inode.c +++ b/fs/autofs4/inode.c @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent) } else { sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID); } - +#ifdef CONFIG_COMPAT + sbi->is32bit = is_compat_task(); +#endif if (autofs_type_trigger(sbi->type)) __managed_dentry_set_managed(root);
[RFC PATCH 2/2] autofs: sent 32-bit sized packet for 32-bit process
The structure autofs_v5_packet (except name) is not aligned by 8 bytes, which leads to different sizes in 32 and 64-bit architectures. Let's form 32-bit compatible packet when daemon has 32-bit addressation. Suggested-by: Dmitry V. Levin Signed-off-by: Stanislav Kinsburskiy --- fs/autofs4/waitq.c |5 + 1 file changed, 5 insertions(+) diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c index 24a58bf..1f9b7d8 100644 --- a/fs/autofs4/waitq.c +++ b/fs/autofs4/waitq.c @@ -151,6 +151,11 @@ static void autofs4_notify_daemon(struct autofs_sb_info *sbi, struct autofs_v5_packet *packet = &pkt.v5_pkt.v5_packet; struct user_namespace *user_ns = sbi->pipe->f_cred->user_ns; +#ifdef CONFIG_COMPAT + if (sbi->is32bit) + pktsz = offsetofend(struct autofs_v5_packet, name); + else +#endif pktsz = sizeof(*packet); packet->wait_queue_token = wq->wait_queue_token;
[RFC PATCH 0/2] autofs: fix autofs_v5_packet dlivery in compat mode
The problem is that in compat mode struct autofs_v5_packet has to have different size (i.e. 4 bytes less). This is RFC because: 1) This issue is hidden, because autofs pipe has O_DIRECT and the rest of the epacket is truncated when read. 2) X86 arch doesn't have is_compat_task() helper 3) It's now clear, what to do if "pgrp" option is specified. The following series implements... --- Stanislav Kinsburskiy (2): autofs: set compat flag on sbi when daemon uses 32bit addressation autofs: sent 32-bit sized packet for 32-bit process fs/autofs4/autofs_i.h |3 +++ fs/autofs4/dev-ioctl.c |3 +++ fs/autofs4/inode.c |4 +++- fs/autofs4/waitq.c |5 + 4 files changed, 14 insertions(+), 1 deletion(-)
Re: [PATCH v2] prctl: remove one-shot limitation for changing exe link
Gentlemen, ping. Let's decide something, how to get rid of this strange solution. It doesn't provide the security it was aimed to, looks ugly and obfuscates the user of the feature. It looks like it can be just thrown away. But if not, please, advice, what should be changed to make is safe and solid. 27.09.2016 17:39, Stanislav Kinsburskiy пишет: This limitation came with the reason to remove "another way for malicious code to obscure a compromised program and masquerade as a benign process" by allowing "security-concious program can use this prctl once during its early initialization to ensure the prctl cannot later be abused for this purpose": http://marc.info/?l=linux-kernel&m=133160684517468&w=2 This explanation doesn't look sufficient. The only thing "exe" link is indicating is the file, used to execve, which is basically nothing and not reliable immediately after process has returned from execve system call. Moreover, to use this feture, all the mappings to previous exe file have to be unmapped and all the new exe file permissions must be satisfied. Which means, that changing exe link is very similar to calling execve on the binary. The need to remove this limitations comes from migration of NFS mount point, which is not accessible during restore and replaced by other file system. Because of this exe link has to be changed twice. v2: Rebased on current linux-next de --- include/linux/sched.h |4 +++- kernel/sys.c | 10 -- 2 files changed, 3 insertions(+), 11 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index a1c9b42..ad48b7d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -520,7 +520,9 @@ static inline int get_dumpable(struct mm_struct *mm) /* leave room for more dump flags */ #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ #define MMF_VM_HUGEPAGE 17 /* set when VM_HUGEPAGE is set on vma */ -#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ +/* This ine-shot flag is droped due to necessivity of changing exe once again + * on NFS restore */ +//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES20 /* MMF_HAS_UPROBES can be wrong */ diff --git a/kernel/sys.c b/kernel/sys.c index 89d5be4..fd6f508 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) fput(exe_file); } - /* -* The symlink can be changed only once, just to disallow arbitrary -* transitions malicious software might bring in. This means one -* could make a snapshot over all processes running and monitor -* /proc/pid/exe changes to notice unusual activity if needed. -*/ - err = -EPERM; - if (test_and_set_bit(MMF_EXE_FILE_CHANGED, &mm->flags)) - goto exit; - err = 0; /* set the new file, lockless */ get_file(exe.file);
[PATCH v2] prctl: remove one-shot limitation for changing exe link
This limitation came with the reason to remove "another way for malicious code to obscure a compromised program and masquerade as a benign process" by allowing "security-concious program can use this prctl once during its early initialization to ensure the prctl cannot later be abused for this purpose": http://marc.info/?l=linux-kernel&m=133160684517468&w=2 This explanation doesn't look sufficient. The only thing "exe" link is indicating is the file, used to execve, which is basically nothing and not reliable immediately after process has returned from execve system call. Moreover, to use this feture, all the mappings to previous exe file have to be unmapped and all the new exe file permissions must be satisfied. Which means, that changing exe link is very similar to calling execve on the binary. The need to remove this limitations comes from migration of NFS mount point, which is not accessible during restore and replaced by other file system. Because of this exe link has to be changed twice. v2: Rebased on current linux-next Signed-off-by: Stanislav Kinsburskiy Acked-by: Oleg Nesterov Acked-by: Cyrill Gorcunov --- include/linux/sched.h |4 +++- kernel/sys.c | 10 -- 2 files changed, 3 insertions(+), 11 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index a1c9b42..ad48b7d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -520,7 +520,9 @@ static inline int get_dumpable(struct mm_struct *mm) /* leave room for more dump flags */ #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ #define MMF_VM_HUGEPAGE17 /* set when VM_HUGEPAGE is set on vma */ -#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ +/* This ine-shot flag is droped due to necessivity of changing exe once again + * on NFS restore */ +//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ #define MMF_HAS_UPROBES19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ diff --git a/kernel/sys.c b/kernel/sys.c index 89d5be4..fd6f508 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) fput(exe_file); } - /* -* The symlink can be changed only once, just to disallow arbitrary -* transitions malicious software might bring in. This means one -* could make a snapshot over all processes running and monitor -* /proc/pid/exe changes to notice unusual activity if needed. -*/ - err = -EPERM; - if (test_and_set_bit(MMF_EXE_FILE_CHANGED, &mm->flags)) - goto exit; - err = 0; /* set the new file, lockless */ get_file(exe.file);
Re: [PATCH] prctl: remove one-shot limitation for changing exe link
30.07.2016 19:31, Eric W. Biederman пишет: Cyrill Gorcunov writes: On Mon, Jul 25, 2016 at 02:56:43PM -0500, Eric W. Biederman wrote: ... Also there is a big fat bug in prctl_set_mm_exe_file. It doesn't validate that the new file is a actually mmaped executable. We would definitely need that to be fixed before even considering removing the limit. Could you please elaborate? We check for inode being executable, what else needed? That the inode is mmaped into the process with executable mappings. Effectively what we check the old mapping for and refuse to remove the old mm_exe_file if it exists. I think a reasonable argument can be made that if the file is executable, and it is mmaped with executable pages that exe_file is not a complete lie. I might be missing something obvious, so sorry for the question -- when criu setups old exe link the inode we obtain from file open is not mapped into memory, the old exe not read by anyone because it's not even executed anyhow. So I don't really understand which mapping we should check here. Mind to point me? That sounds like an out and out bug that should not be preserved. Of course we should mmap the executable and set it up so that it can be executed (at least as much as the executable was previously mapped). Anything else is a buggy restart, and lying to userspace. Which is the important part. At the end of the day how much can userspace trust /proc/pid/exe? If we are too lax it is just a random file descriptor we can not trust at all. At which point there is exactly no point in preserving it in checkpoint/restart, because nothing will trust or look at it. You know, I think we should not trust exe link much, and in real we never could: this link is rather a hint pointing which executable a process has been using on execve call, once the process start working one can't be sure if the code currently running is exactly from the file pointed by exe link. It just a hint suitable for debuggin and obtain clean view of which processes are running on noncompromised system. Monitoring exe link change won't help much if there are malicious software running on the system. But it is not just a hint. It is a record of which executable we called execve on. Knowing which file was executed doesn't guarantee what is running now but it provides a very strong hint. At then end of a restart the state of a process should be (by definition) exactly the state the process was before a checkpoint and thus a state the original executable could have gotten into. I admit it is possible for an application to unmap itself. I honestly have not met that application (except perhaps criu). If the only user is checkpoint/restart perhaps it should be only ptrace that can set this and not the process itself with a prctl. I don't know. All I know is that we should work on making it a very trustable value even though in some specific instances we can set it. Since as I said I suppose nobody except us using this feature, we can setup some sysctl trigger for it (I personally think this is an overkill, but OTOH if people rely on the exe link and not going to use criu at all, this trigger will help). Some clarity of thought came to me, and I apologise for not replying sooner with it sooner. My problem with the original patch submission is that it was justifying changing prctl_set_mm_exe_file based on what prctl_set_mm_exe_file does today. As prctl_set_mm_exe_file was added for the checkpoint/restart case that is justifying changing code based on a buggy implementation. It is necessary to look at the ordinary situation. Without prctl_set_mm_exe /proc/[pid]/exe can be counted on as a record of which executable was last passed to execve. Furthermore the state of a process can be counted on to be a state reachable from calling execve on /proc/[pid]/exe. Which means to preserve those expectations prctl_set_mm_exe_file should in practice just be a nicer less cumbersome interface to things you can already achieve with execve. Justifying removale of the one-short nature for prctl_set_mm_exe_file is as straight forward as noting that a process can call execve on any executable file. However when I compare the invariants that execve has on a file (such as the executable being mmaped) I see some noticable disparities between what prctl_set_mm_exe_file allows and what execve allows. With prctl_set_mm_exe being less strict. So what I am requesting is very simple. That the checks in prctl_set_mm_exe_file be tightened up to more closely approach what execve requires. Thus preserving the value of the /proc/[pid]/exe for the applications that want to use the exe link. Eric, could you elaborate on the checks, you mentioned? I don't see any significant checks, which are applicable to prctl_set_mm_exe_file. Say, there is a check for PF_NPROC_EXCEEDED in execve, but this is not applicable. Some of the checks are related to file open, but this is not done in prctl_set_mm_exe_file.
Re: [RFC PATCH] sunrpc: do not allow process to freeze within RPC state machine
04.08.2016 15:16, Jeff Layton пишет: On Thu, 2016-08-04 at 12:55 +0200, Stanislav Kinsburskiy wrote: 03.08.2016 19:36, Jeff Layton пишет: On Wed, 2016-08-03 at 20:54 +0400, Stanislav Kinsburskiy wrote: Otherwise freezer cgroup state might never become "FROZEN". Here is a deadlock scheme for 2 processes in one freezer cgroup, which is freezing: CPU 0 CPU 1 do_last inode_lock(dir->d_inode) vfs_create nfs_create ... __rpc_execute rpc_wait_bit_killable __refrigerator do_last inode_lock(dir->d_inode) So, the problem is that one process takes directory inode mutex, executes creation request and goes to refrigerator. Another one waits till directory lock is released, remains "thawed" and thus freezer cgroup state never becomes "FROZEN". Notes: 1) Interesting, that this is not a pure deadlock: one can thaw cgroup and then freeze it again. 2) The issue was introduced by commit d310310cbff18ec385c6ab4d58f33b100192a96a. 3) This patch is not aimed to fix the issue, but to show the problem root. Look like this problem moght be applicable to other hunks from the commit, mentioned above. Signed-off-by: Stanislav Kinsburskiy --- net/sunrpc/sched.c |1 - 1 file changed, 1 deletion(-) diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index 9ae5885..ec7ccc1 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -253,7 +253,6 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue); static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode) { - freezable_schedule_unsafe(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; Ummm...so what actually does the schedule() with this patch? Schedule() replaces freezable_schedule_unsafe() of course, sorry for this. There was a bit of discussion on this recently -- see the thread with this subject line in linux-nfs: Re: Hang due to nfs letting tasks freeze with locked inodes Thanks, had a look. Basically it comes down to this: All of the proposals so far to fix this problem just switch out the freezable_schedule_unsafe (and similar) calls for those that don't allow the process to freeze. The problem there is that we originally added that stuff in response to bug reports about machines failing to suspend. What often happens is that the network interfaces come down, and then the freezer runs over all of the processes, which never return because they're blocked waiting on the server to reply. I probably don't understand something, but this sounds somewhat wrong to me: freezing processes _after_ network is down. ...shrug... Maybe we should just go ahead and do it (and to CIFS as well). Just be prepared for the inevitable complaints about laptops failing to suspend once you do. The worst part in all of this, from my POW, is that current behavior makes NFS non-freezable in a generic case, even in case of freezing a container, which has it's own net ns and NFS mount. So, I would say, that returning of previous logic would make the world better. Part of the fix, I think is to add a return code (similar to ERESTARTSYS) that gets interpreted near the kernel-userland boundary as: "allow the process to be frozen, and then retry the call once it's resumed". With that, filesystems could return the error code when they want to redrive the entire syscall from that level. That won't work for non- idempotent requests though. We'd need to do something more elaborate there. Might be, that breaking rpc request is something that should be avoided at all. With all these locks being held, almost all (any?) of the requests to remote server should be considered as an atomic operation from freezer point of view. The process always can be frozen on signal handling. IOW, I might worth considering a scenario, when NFS is not freezable at all, and any problems with suspend on laptops/whatever have to solved in suspend code. Fair enough. At this point, I don't care much one way or another. Maybe if we make this change and laptops start failing to suspend, we'll be able to use that as leverage pursue other avenues to make the suspend/resume subsystem work with NFS. That said, the patch you have really isn't sufficient. There are places where the NFS client can sleep while waiting for things other than RPC calls. Sure. As I said, this patch wasn't aimed to fix the issue but rather start the discussion. Thanks for your patch.
Re: [RFC PATCH] sunrpc: do not allow process to freeze within RPC state machine
03.08.2016 19:36, Jeff Layton пишет: On Wed, 2016-08-03 at 20:54 +0400, Stanislav Kinsburskiy wrote: Otherwise freezer cgroup state might never become "FROZEN". Here is a deadlock scheme for 2 processes in one freezer cgroup, which is freezing: CPU 0 CPU 1 do_last inode_lock(dir->d_inode) vfs_create nfs_create ... __rpc_execute rpc_wait_bit_killable __refrigerator do_last inode_lock(dir->d_inode) So, the problem is that one process takes directory inode mutex, executes creation request and goes to refrigerator. Another one waits till directory lock is released, remains "thawed" and thus freezer cgroup state never becomes "FROZEN". Notes: 1) Interesting, that this is not a pure deadlock: one can thaw cgroup and then freeze it again. 2) The issue was introduced by commit d310310cbff18ec385c6ab4d58f33b100192a96a. 3) This patch is not aimed to fix the issue, but to show the problem root. Look like this problem moght be applicable to other hunks from the commit, mentioned above. Signed-off-by: Stanislav Kinsburskiy --- net/sunrpc/sched.c |1 - 1 file changed, 1 deletion(-) diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index 9ae5885..ec7ccc1 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -253,7 +253,6 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue); static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode) { - freezable_schedule_unsafe(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; Ummm...so what actually does the schedule() with this patch? Schedule() replaces freezable_schedule_unsafe() of course, sorry for this. There was a bit of discussion on this recently -- see the thread with this subject line in linux-nfs: Re: Hang due to nfs letting tasks freeze with locked inodes Thanks, had a look. Basically it comes down to this: All of the proposals so far to fix this problem just switch out the freezable_schedule_unsafe (and similar) calls for those that don't allow the process to freeze. The problem there is that we originally added that stuff in response to bug reports about machines failing to suspend. What often happens is that the network interfaces come down, and then the freezer runs over all of the processes, which never return because they're blocked waiting on the server to reply. I probably don't understand something, but this sounds somewhat wrong to me: freezing processes _after_ network is down. ...shrug... Maybe we should just go ahead and do it (and to CIFS as well). Just be prepared for the inevitable complaints about laptops failing to suspend once you do. The worst part in all of this, from my POW, is that current behavior makes NFS non-freezable in a generic case, even in case of freezing a container, which has it's own net ns and NFS mount. So, I would say, that returning of previous logic would make the world better. Part of the fix, I think is to add a return code (similar to ERESTARTSYS) that gets interpreted near the kernel-userland boundary as: "allow the process to be frozen, and then retry the call once it's resumed". With that, filesystems could return the error code when they want to redrive the entire syscall from that level. That won't work for non- idempotent requests though. We'd need to do something more elaborate there. Might be, that breaking rpc request is something that should be avoided at all. With all these locks being held, almost all (any?) of the requests to remote server should be considered as an atomic operation from freezer point of view. The process always can be frozen on signal handling. IOW, I might worth considering a scenario, when NFS is not freezable at all, and any problems with suspend on laptops/whatever have to solved in suspend code.
[RFC PATCH] sunrpc: do not allow process to freeze within RPC state machine
Otherwise freezer cgroup state might never become "FROZEN". Here is a deadlock scheme for 2 processes in one freezer cgroup, which is freezing: CPU 0 CPU 1 do_last inode_lock(dir->d_inode) vfs_create nfs_create ... __rpc_execute rpc_wait_bit_killable __refrigerator do_last inode_lock(dir->d_inode) So, the problem is that one process takes directory inode mutex, executes creation request and goes to refrigerator. Another one waits till directory lock is released, remains "thawed" and thus freezer cgroup state never becomes "FROZEN". Notes: 1) Interesting, that this is not a pure deadlock: one can thaw cgroup and then freeze it again. 2) The issue was introduced by commit d310310cbff18ec385c6ab4d58f33b100192a96a. 3) This patch is not aimed to fix the issue, but to show the problem root. Look like this problem moght be applicable to other hunks from the commit, mentioned above. Signed-off-by: Stanislav Kinsburskiy --- net/sunrpc/sched.c |1 - 1 file changed, 1 deletion(-) diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index 9ae5885..ec7ccc1 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -253,7 +253,6 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue); static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode) { - freezable_schedule_unsafe(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0;
Re: [PATCH] prctl: remove one-shot limitation for changing exe link
25.07.2016 20:21, Eric W. Biederman пишет: Stanislav Kinsburskiy writes: Gentlemen, Looks like there are no objections to this patch. There has been objection. The only justification for the change that has been put forward is someone doing a restore lazily. I don't see a reason why you can't call prctl_set_mm_exe_file until you have the file in place instead of a place holder that sounds like a trivial solution to any restore issues. If I understand your proposal correctly, you mean, that the call to prctl_set_mm_exe_file can be posponed till the actual file is in place. It can be done this way you propose (although it's ugly). But you objection looks like you want to preserve some behavior you believe is reliable. But it's not. The truth is an unlimited settable exe link is essentially meaningless, as you can't depend on it for anything. One shot seems the best compromise I have seen put forward between the definite checkpoint/restart requirement to set the this value and the general need to have something that makes sense and people can depend on for system management. Depending on exe link for system management is useful, but can't be reliable and can't prevent malicious software to compromise the system. If we wouldn't have the ability to change exe link, than the only thing we could be sure, that process at least started with the binary we believe is reliable. But since we can change it, the only thing we can rely now is the file is executable and process have permissions to execute it. And this guarantee in preserved across any amount of exe link replacement. Also there is a big fat bug in prctl_set_mm_exe_file. It doesn't validate that the new file is a actually mmaped executable. We would definitely need that to be fixed before even considering removing the limit. Why do we need it? To guarantee, that process has read permission to the file? Right now all I see is people involved in the implementation details of their own little feature So for the patch I am responding to: Nacked-by: "Eric W. Biederman" Plus the merge window is open so no one is taking any patches right now. It is the time to take what has already been staged and get that code merged. Eric
Re: [PATCH] prctl: remove one-shot limitation for changing exe link
18.07.2016 22:11, One Thousand Gnomes пишет: 1) Attach to process via ptrace (protected by CAP_SYS_PTRACE) 2) Unmap all the process file mappings, related to "exe" file. 3) Change exe link (protected by CAP_SYS_RESOURCE). IOW, some other process already has an access to process internals (and thus it's already compromised), and can inject fork and use the child of the compromised program to masquerade. Which means this limitation doesn't solve the problem it was aimed to. IFF it is the same uid or root (in which case you already lost). In the case of cross uid activity this is not true. Could you elaborate on it, please? Alan
Re: [PATCH] prctl: remove one-shot limitation for changing exe link
12.07.2016 18:52, Eric W. Biederman пишет: Cyrill Gorcunov writes: On Tue, Jul 12, 2016 at 07:30:29PM +0400, Stanislav Kinsburskiy wrote: This limitation came with the reason to remove "another way for malicious code to obscure a compromised program and masquerade as a benign process" by allowing "security-concious program can use this prctl once during its early initialization to ensure the prctl cannot later be abused for this purpose": http://marc.info/?l=linux-kernel&m=133160684517468&w=2 But the way how the feature can be used is the following: 1) Attach to process via ptrace (protected by CAP_SYS_PTRACE) 2) Unmap all the process file mappings, related to "exe" file. 3) Change exe link (protected by CAP_SYS_RESOURCE). IOW, some other process already has an access to process internals (and thus it's already compromised), and can inject fork and use the child of the compromised program to masquerade. Which means this limitation doesn't solve the problem it was aimed to. While removing this limitation allow to replace files from underneath of a running process as many times as required. One of the use cases is network file systems migration (NFS, to be precise) by CRIU. NFS mount can't be mounted on restore stage because network is locked. To overcome this limitation, another file system (FUSE-based) is used. Then opened files replaced by the proper ones NFS is remounted. Thus exe link replace has to be done twice: first on restore stage and second - when actual NFS was remounted. Signed-off-by: Stanislav Kinsburskiy Persistent exe-link doesn't guarantee anything if you have rights to ptrace task and inject own code into (from security POV). So lets rip it out. Acked-by: Cyrill Gorcunov I believe the original concern was someone injecting a code into a process and playing silly buggers with the exe link. Someone who does not have ptrace capability. It is completely not ok to change this until someone goes back to the original conversation and looks at the original threat model, and refutes it. Fair enough, Eric. That's why all the people, who were participating in original discussion, included in the recipients list. Eric
Re: [PATCH] prctl: remove one-shot limitation for changing exe link
12.07.2016 18:42, Oleg Nesterov пишет: On 07/12, Stanislav Kinsburskiy wrote: --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) fput(exe_file); } - /* -* The symlink can be changed only once, just to disallow arbitrary -* transitions malicious software might bring in. This means one -* could make a snapshot over all processes running and monitor -* /proc/pid/exe changes to notice unusual activity if needed. -*/ - err = -EPERM; - if (test_and_set_bit(MMF_EXE_FILE_CHANGED, &mm->flags)) - goto exit; - I didn't even try to read the changelog so I do not know why do you want this change ;) But I would like to ack it in any case. I never understood why do we want/need this MMF_EXE_FILE_CHANGED check, I suggested to remove it many times. And can't resist, please note the xchg() below. Currently (before this patch) we do not need it. I was specially added to ensure that we can just remove this test_and_set_bit(MMF_EXE_FILE_CHANGED) without adding a race. Thanks, Oleg. I'll take a look. But should this be addressed in this patch? Especially if it's not needed even now (before this patch)? Acked-by: Oleg Nesterov
[PATCH] prctl: remove one-shot limitation for changing exe link
This limitation came with the reason to remove "another way for malicious code to obscure a compromised program and masquerade as a benign process" by allowing "security-concious program can use this prctl once during its early initialization to ensure the prctl cannot later be abused for this purpose": http://marc.info/?l=linux-kernel&m=133160684517468&w=2 But the way how the feature can be used is the following: 1) Attach to process via ptrace (protected by CAP_SYS_PTRACE) 2) Unmap all the process file mappings, related to "exe" file. 3) Change exe link (protected by CAP_SYS_RESOURCE). IOW, some other process already has an access to process internals (and thus it's already compromised), and can inject fork and use the child of the compromised program to masquerade. Which means this limitation doesn't solve the problem it was aimed to. While removing this limitation allow to replace files from underneath of a running process as many times as required. One of the use cases is network file systems migration (NFS, to be precise) by CRIU. NFS mount can't be mounted on restore stage because network is locked. To overcome this limitation, another file system (FUSE-based) is used. Then opened files replaced by the proper ones NFS is remounted. Thus exe link replace has to be done twice: first on restore stage and second - when actual NFS was remounted. Signed-off-by: Stanislav Kinsburskiy --- include/linux/sched.h |4 +++- kernel/sys.c | 10 -- 2 files changed, 3 insertions(+), 11 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 553af29..83b5f2d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -518,7 +518,9 @@ static inline int get_dumpable(struct mm_struct *mm) /* leave room for more dump flags */ #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ #define MMF_VM_HUGEPAGE17 /* set when VM_HUGEPAGE is set on vma */ -#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ +/* This ine-shot flag is droped due to necessivity of changing exe once again + * on NFS restore */ +//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ #define MMF_HAS_UPROBES19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ diff --git a/kernel/sys.c b/kernel/sys.c index 89d5be4..fd6f508 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) fput(exe_file); } - /* -* The symlink can be changed only once, just to disallow arbitrary -* transitions malicious software might bring in. This means one -* could make a snapshot over all processes running and monitor -* /proc/pid/exe changes to notice unusual activity if needed. -*/ - err = -EPERM; - if (test_and_set_bit(MMF_EXE_FILE_CHANGED, &mm->flags)) - goto exit; - err = 0; /* set the new file, lockless */ get_file(exe.file);
[PATCH] prctl: remove one-shot limitation for changing exe link
This limitation came with the reason to remove "another way for malicious code to obscure a compromised program and masquerade as a benign process" by allowing "security-concious program can use this prctl once during its early initialization to ensure the prctl cannot later be abused for this purpose": http://marc.info/?l=linux-kernel&m=133160684517468&w=2 But the way how the feature can be used is the following: 1) Attach to process via ptrace (protected by CAP_SYS_PTRACE) 2) Unmap all the process file mappings, related to "exe" file. 3) Change exe link (protected by CAP_SYS_RESOURCE). IOW, some other process already has an access to process internals (and thus it's already compromised), and can inject fork and use the child of the compromised program to masquerade. Which means this limitation doesn't solve the problem it was aimed to. While removing this limitation allow to replace files from underneath of a running process as many times as required. One of the use cases is network file systems migration (NFS, to be precise) by CRIU. NFS mount can't be mounted on restore stage because network is locked. To overcome this limitation, another file system (FUSE-based) is used. Then opened files replaced by the proper ones NFS is remounted. Thus exe link replace has to be done twice: first on restore stage and second - when actual NFS was remounted. Signed-off-by: Stanislav Kinsburskiy --- include/linux/sched.h |4 +++- kernel/sys.c | 10 -- 2 files changed, 3 insertions(+), 11 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 553af29..83b5f2d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -518,7 +518,9 @@ static inline int get_dumpable(struct mm_struct *mm) /* leave room for more dump flags */ #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ #define MMF_VM_HUGEPAGE17 /* set when VM_HUGEPAGE is set on vma */ -#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ +/* This ine-shot flag is droped due to necessivity of changing exe once again + * on NFS restore */ +//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ #define MMF_HAS_UPROBES19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ diff --git a/kernel/sys.c b/kernel/sys.c index 89d5be4..fd6f508 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) fput(exe_file); } - /* -* The symlink can be changed only once, just to disallow arbitrary -* transitions malicious software might bring in. This means one -* could make a snapshot over all processes running and monitor -* /proc/pid/exe changes to notice unusual activity if needed. -*/ - err = -EPERM; - if (test_and_set_bit(MMF_EXE_FILE_CHANGED, &mm->flags)) - goto exit; - err = 0; /* set the new file, lockless */ get_file(exe.file);
Re: call_usermodehelper in containers
13.02.2016 00:39, Ian Kent пишет: On Fri, 2013-11-15 at 15:54 +0400, Stanislav Kinsbursky wrote: 15.11.2013 15:03, Eric W. Biederman пишет: Stanislav Kinsbursky writes: 12.11.2013 17:30, Jeff Layton пишет: On Tue, 12 Nov 2013 17:02:36 +0400 Stanislav Kinsbursky wrote: 12.11.2013 15:12, Jeff Layton пишет: On Mon, 11 Nov 2013 16:47:03 -0800 Greg KH wrote: On Mon, Nov 11, 2013 at 07:18:25AM -0500, Jeff Layton wrote: We have a bit of a problem wrt to upcalls that use call_usermodehelper with containers and I'd like to bring this to some sort of resolution... A particularly problematic case (though there are others) is the nfsdcltrack upcall. It basically uses call_usermodehelper to run a program in userland to track some information on stable storage for nfsd. I thought the discussion at the kernel summit about this issue was: - don't do this. - don't do it. - if you really need to do this, fix nfsd Sorry, I couldn't make the kernel summit so I missed that discussion. I guess LWN didn't cover it? In any case, I guess then that we'll either have to come up with some way to fix nfsd here, or simply ensure that nfsd can never be started unless root in the container has a full set of a full set of capabilities. One sort of Rube Goldberg possibility to fix nfsd is: - when we start nfsd in a container, fork off an extra kernel thread that just sits idle. That thread would need to be a descendant of the userland process that started nfsd, so we'd need to create it with kernel_thread(). - Have the kernel just start up the UMH program in the init_ns mount namespace as it currently does, but also pass the pid of the idle kernel thread to the UMH upcall. - The program will then use /proc//root and /proc//ns/* to set itself up for doing things properly. Note that with this mechanism we can't actually run a different binary per container, but that's probably fine for most purposes. Hmmm... Why we can't? We can go a bit further with userspace idea. We use UMH some very limited number of user programs. For 2, actually: 1) /sbin/nfs_cache_getent 2) /sbin/nfsdcltrack No, the kernel uses them for a lot more than that. Pretty much all of the keys API upcalls use it. See all of the callers of call_usermodehelper. All of them are running user binaries out of the kernel, and almost all of them are certainly broken wrt containers. If we convert them into proxies, which use /proc//root and /proc//ns/*, this will allow us to lookup the right binary. The only limitation here is presence of this "proxy" binaries on "host". Suppose I spawn my own container as a user, using all of this spiffy new user namespace stuff. Then I make the kernel use call_usermodehelper to call the upcall in the init_ns, and then trick it into running my new "escape_from_namespace" program with "real" root privileges. I don't think we can reasonably assume that having the kernel exec an arbitrary binary inside of a container is safe. Doing so inside of the init_ns is marginally more safe, but only marginally so... And we don't need any significant changes in kernel. BTW, Jeff, could you remind me, please, why exactly we need to use UMH to run the binary? What are this capabilities, which force us to do so? Nothing _forces_ us to do so, but upcalls are very difficult to handle, and UMH has a lot of advantages over a long-running daemon launched by userland. Originally, I created the nfsdcltrack upcall as a running daemon called nfsdcld, and the kernel used rpc_pipefs to communicate with it. Everyone hated it because no one likes to have to run daemons for infrequently used upcalls. It's a pain for users to ensure that it's running and it's a pain to handle when it isn't. So, I was encouraged to turn that instead into a UMH upcall. But leaving that aside, this problem is a lot larger than just nfsd. We have a *lot* of UMH upcalls in the kernel, so this problem is more general than just "fixing" nfsd's. Ok. So we are talking about generic approach to UMH support in a container (and/or namespace). Actually, as far as I can see, there are more that one aspect, which is not supported. One one them is executing of the right binary. Another one is capabilities (and maybe there are more, like user namespaces), but I don't really care about them for now. Executing the right binary, actually, is not about namespaces at all. This is about lookup implementation in VFS (do_execve_common). Would be great to unshare FS for forked UHM kthread and swap it to desired root. This will solve the problem with proper lookup. However, as far as I understand, this approach is not welcome by the community. I don't understand that one. Having a preforked thread with the proper environment that can act like kthreadd in terms of spawning user mode helpers works and is simple. The only downside I can see is that there is extra overhead. What do you mean by "simple" here? Simp
[PATCH v2] autofs: show pipe inode in mount options
This is required for CRIU (Checkpoint Restart In Userspace) to migrate a mount point, when write end in user space is closed. Below is the brief description of the problem. To migrate non-catatonic autofs mount point, one have to restore control pipe between kernel and and autofs master process. One of the autofs masters is systemd, which closes pipe write end after passing it to the kernel with mount call. To be able to restore systemd control pipe, one have to know, which read pipe end in systemd corresponds to the write pipe end in the kernel. Pipe "fd" in mount options is not enough, because it was closed and probably replaced by some other descriptor. Thus, some other attribute is required to be able to find the read pipe end. The best attribute, allowing to find correct pipe end is inode number, becuase it's unique for the whole system and can't be reused until autofs mount exists. This attribute also allows to recognize a situation with autofs mount without master (no process with specified "pgrp" or not file descriptor with "pipe_ino", specified in autofs mount options). v2: 1) New option "pipe_ino" was moved to the end of the options list. 2) Option is printed only if CONFIG_CHECKPOINT_RESTORE is set. Signed-off-by: Stanislav Kinsburskiy --- fs/autofs4/inode.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c index a3ae0b2..4320faa 100644 --- a/fs/autofs4/inode.c +++ b/fs/autofs4/inode.c @@ -94,7 +94,12 @@ static int autofs4_show_options(struct seq_file *m, struct dentry *root) seq_printf(m, ",direct"); else seq_printf(m, ",indirect"); - +#ifdef CONFIG_CHECKPOINT_RESTORE + if (sbi->pipe) + seq_printf(m, ",pipe_ino=%ld", sbi->pipe->f_inode->i_ino); + else + seq_printf(m, ",pipe_ino=-1"); +#endif return 0; }
Re: [PATCH] autofs: show pipe inode in mount options
23.01.2016 01:30, Ian Kent пишет: On Fri, 2016-01-22 at 12:34 +0100, Stanislav Kinsburskiy wrote: Hi again, I would like to ask about any progress with this patch? Any other requirements to make it able to merge? Sorry for the delay. Since there haven't been any comments from Al or Stephen I'm think I should include it in the series I plan on sending to linux-next to rename autofs4 to autofs (among other things). I haven't had anything significant enough for autofs to warrant maintaining a tree and sending push requests so I'll need to ask Stephen what I need to do (perhaps you could offer some advise on that now Stephen, please). I'm also struggling to get back to this and carry out the needed testing and I'll need to re-base the series too now but I'm getting there. I didn't see a follow up patch with an updated description, did I miss it? No, you didn't. I have send it few minutes ago. Thanks. Ian
Re: [PATCH] autofs: show pipe inode in mount options
Hi again, I would like to ask about any progress with this patch? Any other requirements to make it able to merge?
Re: [PATCH] autofs: show pipe inode in mount options
Good day, gentlemen. Could you update, what's the status with this patch? Without it it's impossible to match process pipe with kernel pipe, while this is "must have" to be able to migrate AutoFS via CRIU. 16.12.2015 13:02, Stanislav Kinsburskiy пишет: This is required for CRIU to migrate a mount point, when write end in user space is closed. To be able to migrate such mount, read end of the pipe have to be searched within autofs master process, and pipe inode will be used as a key. Signed-off-by: Stanislav Kinsburskiy --- fs/autofs4/inode.c |4 1 file changed, 4 insertions(+) diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c index a3ae0b2..16f875a 100644 --- a/fs/autofs4/inode.c +++ b/fs/autofs4/inode.c @@ -77,6 +77,10 @@ static int autofs4_show_options(struct seq_file *m, struct dentry *root) return 0; seq_printf(m, ",fd=%d", sbi->pipefd); + if (sbi->pipe) + seq_printf(m, ",pipe_ino=%ld", sbi->pipe->f_inode->i_ino); + else + seq_printf(m, ",pipe_ino=-1"); if (!uid_eq(root_inode->i_uid, GLOBAL_ROOT_UID)) seq_printf(m, ",uid=%u", from_kuid_munged(&init_user_ns, root_inode->i_uid)); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] autofs: show pipe inode in mount options
This is required for CRIU to migrate a mount point, when write end in user space is closed. To be able to migrate such mount, read end of the pipe have to be searched within autofs master process, and pipe inode will be used as a key. Signed-off-by: Stanislav Kinsburskiy --- fs/autofs4/inode.c |4 1 file changed, 4 insertions(+) diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c index a3ae0b2..16f875a 100644 --- a/fs/autofs4/inode.c +++ b/fs/autofs4/inode.c @@ -77,6 +77,10 @@ static int autofs4_show_options(struct seq_file *m, struct dentry *root) return 0; seq_printf(m, ",fd=%d", sbi->pipefd); + if (sbi->pipe) + seq_printf(m, ",pipe_ino=%ld", sbi->pipe->f_inode->i_ino); + else + seq_printf(m, ",pipe_ino=-1"); if (!uid_eq(root_inode->i_uid, GLOBAL_ROOT_UID)) seq_printf(m, ",uid=%u", from_kuid_munged(&init_user_ns, root_inode->i_uid)); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 2/3] splice: new SPLICE_F_PACKET flag introduced
This flag is used by kernel only to represent pipe packetized mode for splice engine. Signed-off-by: Stanislav Kinsburskiy --- include/linux/splice.h |5 + 1 file changed, 5 insertions(+) diff --git a/include/linux/splice.h b/include/linux/splice.h index da2751d..13ca14d 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -20,6 +20,11 @@ #define SPLICE_F_MORE (0x04) /* expect more data */ #define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */ +#ifdef __KERNEL__ + +#define SPLICE_F_PACKET(0x1000) /* pipe buffers have to be packetized */ + +#endif /* * Passed to the actors */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 0/3] Fix splice for packetized pipes
With introduction of packetized pipes, splice wasn't updated to respect this new behaviour. In terms of splice it means, that it never set PIPE_BUF_FLAG_PACKET on created pipe buffer regarless pipe operating mode, thus breaking the whole logic. To fix this, new SPLICE_F_PACKET flag was introduced. It's supposed to be used only in kernel and set, when write pipe is in packetized mode. In splice_to_pipe() it's converted into created pipe buffer PIPE_BUF_FLAG_PACKET flag. The following series implements... --- Stanislav Kinsburskiy (3): pipe: make is_packetized() non-static and declare in pipe_fs_i.h splice: new SPLICE_F_PACKET flag introduced splice: add support of splicing to packetized pipe fs/pipe.c |2 +- fs/splice.c |8 include/linux/pipe_fs_i.h |2 ++ include/linux/splice.h|5 + 4 files changed, 16 insertions(+), 1 deletion(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 1/3] pipe: make is_packetized() non-static and declare in pipe_fs_i.h
With introduction of packetized pipe mode, represented by O_DIRECT flag, splice stopped working correctly with a pipe in this mode. To be able to fix them, this helper have to exposed. Signed-off-by: Stanislav Kinsburskiy --- fs/pipe.c |2 +- include/linux/pipe_fs_i.h |2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/pipe.c b/fs/pipe.c index 42cf8dd..645d142 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -329,7 +329,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) return ret; } -static inline int is_packetized(struct file *file) +int is_packetized(struct file *file) { return (file->f_flags & O_DIRECT) != 0; } diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index eb8b8ac..9fdaf8f 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -146,4 +146,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file); int create_pipe_files(struct file **, int); +int is_packetized(struct file *file); + #endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 3/3] splice: add support of splicing to packetized pipe
This patch uses SPLICE_F_PACKET as a flag, representing packetized pipe. In splice_to_pipe() this flag is converted into PIPE_BUF_FLAG_PACKET on pipe buffer. Signed-off-by: Stanislav Kinsburskiy --- fs/splice.c |8 1 file changed, 8 insertions(+) diff --git a/fs/splice.c b/fs/splice.c index 4cf700d..d698fb2 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -210,6 +210,8 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, buf->ops = spd->ops; if (spd->flags & SPLICE_F_GIFT) buf->flags |= PIPE_BUF_FLAG_GIFT; + if (spd->flags & SPLICE_F_PACKET) + buf->flags |= PIPE_BUF_FLAG_PACKET; pipe->nrbufs++; page_nr++; @@ -1420,6 +1422,9 @@ static long do_splice(struct file *in, loff_t __user *off_in, offset = in->f_pos; } + if (is_packetized(out)) + flags |= SPLICE_F_PACKET; + ret = do_splice_to(in, &offset, opipe, len, flags); if (!off_in) @@ -1609,6 +1614,9 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov, if (splice_grow_spd(pipe, &spd)) return -ENOMEM; + if (is_packetized(file)) + spd.flags |= SPLICE_F_PACKET; + spd.nr_pages = get_iovec_page_array(iov, nr_segs, spd.pages, spd.partial, false, spd.nr_pages_max); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fcntl: allow to set O_DIRECT flag on pipe
With packetized mode for pipes, it's not possible to set O_DIRECT on pipe file via sys_fcntl, because of unsupported (by pipes) sanity checks. Ability to set this flag will be used by CRIU to migrate packetized pipes. Signed-off-by: Stanislav Kinsburskiy --- fs/fcntl.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index ee85cd4..4463a87 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -51,7 +51,8 @@ static int setfl(int fd, struct file * filp, unsigned long arg) if (arg & O_NDELAY) arg |= O_NONBLOCK; - if (arg & O_DIRECT) { + /* Pipe packetized mode is controlled by O_DIRECT flag */ + if (!IS_FIFO(f->f_mode) && (arg & O_DIRECT)) { if (!filp->f_mapping || !filp->f_mapping->a_ops || !filp->f_mapping->a_ops->direct_IO) return -EINVAL; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] fcntl: allow to set O_DIRECT flag on pipe
With packetized mode for pipes, it's not possible to set O_DIRECT on pipe file via sys_fcntl, because of unsupported sanity checks. Ability to set this flag will be used by CRIU to migrate packetized pipes. v2: Fixed typos and mode variable to check. Signed-off-by: Stanislav Kinsburskiy --- fs/fcntl.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index ee85cd4..350a2c8 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -51,7 +51,8 @@ static int setfl(int fd, struct file * filp, unsigned long arg) if (arg & O_NDELAY) arg |= O_NONBLOCK; - if (arg & O_DIRECT) { + /* Pipe packetized mode is controlled by O_DIRECT flag */ + if (!S_ISFIFO(filp->f_inode->i_mode) && (arg & O_DIRECT)) { if (!filp->f_mapping || !filp->f_mapping->a_ops || !filp->f_mapping->a_ops->direct_IO) return -EINVAL; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/