subject:"\[PATCH v4\] pidns\: introduce syscall translate

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-04 Thread Konstantin Khlebnikov


On 04.04.2018 00:51, Nagarathnam Muthusamy wrote:



On 04/03/2018 02:52 PM, Andrew Morton wrote:

On Tue, 3 Apr 2018 14:45:28 -0700 Nagarathnam Muthusamy 
 wrote:


This changelog doesn't explain what the value is to our users.  I
assume it is a performance optimization because "backward translation
requires scanning all tasks"?  If so, please show us real-world
examples of the performance benefit from this patch, and please go to
great lengths to explain to us why this optimisation is needed by our
users.

One of the usecase by Oracle database involves multiple levels of
nested pid namespaces and we require pid translation between the
levels. Discussions on the particular usecase, why any of the existing
methods was not usable happened in the following thread.

https://patchwork.kernel.org/patch/10276785/

At the end, it was agreed that this patch along with flocks will solve the
issue.

Nobody who reads this patch's changelog will know any of this.  Please
let's get all this information into the proper place.

Sure! Will resend the patch with updated change log.


I have v5 version of this proposal in work.

I've redesigned interface to be more convenient for cases where
strict race-protection isn't required and pid-ns could be referenced pid.

It has 5 arguments rather than 3 because types of references are
defined explicitly rather than magic like -1, >0, <0.
This more verbose but protects against errors like passing -1 from
failed previous syscall as argument.

kind of
translate_pid(pid, TRANSLATE_PID_FD_PIDNS, ns_fd, TRANSLATE_PID_CURRENT_PIDNS, 
0)

I'll send it today with including more detailed motivation for patch.

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-04 Thread Konstantin Khlebnikov


On 04.04.2018 00:51, Nagarathnam Muthusamy wrote:



On 04/03/2018 02:52 PM, Andrew Morton wrote:

On Tue, 3 Apr 2018 14:45:28 -0700 Nagarathnam Muthusamy 
 wrote:


This changelog doesn't explain what the value is to our users.  I
assume it is a performance optimization because "backward translation
requires scanning all tasks"?  If so, please show us real-world
examples of the performance benefit from this patch, and please go to
great lengths to explain to us why this optimisation is needed by our
users.

One of the usecase by Oracle database involves multiple levels of
nested pid namespaces and we require pid translation between the
levels. Discussions on the particular usecase, why any of the existing
methods was not usable happened in the following thread.

https://patchwork.kernel.org/patch/10276785/

At the end, it was agreed that this patch along with flocks will solve the
issue.

Nobody who reads this patch's changelog will know any of this.  Please
let's get all this information into the proper place.

Sure! Will resend the patch with updated change log.


I have v5 version of this proposal in work.

I've redesigned interface to be more convenient for cases where
strict race-protection isn't required and pid-ns could be referenced pid.

It has 5 arguments rather than 3 because types of references are
defined explicitly rather than magic like -1, >0, <0.
This more verbose but protects against errors like passing -1 from
failed previous syscall as argument.

kind of
translate_pid(pid, TRANSLATE_PID_FD_PIDNS, ns_fd, TRANSLATE_PID_CURRENT_PIDNS, 
0)

I'll send it today with including more detailed motivation for patch.

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-03 Thread Nagarathnam Muthusamy




On 04/03/2018 02:52 PM, Andrew Morton wrote:

On Tue, 3 Apr 2018 14:45:28 -0700 Nagarathnam Muthusamy 
 wrote:


This changelog doesn't explain what the value is to our users.  I
assume it is a performance optimization because "backward translation
requires scanning all tasks"?  If so, please show us real-world
examples of the performance benefit from this patch, and please go to
great lengths to explain to us why this optimisation is needed by our
users.

One of the usecase by Oracle database involves multiple levels of
nested pid namespaces and we require pid translation between the
levels. Discussions on the particular usecase, why any of the existing
methods was not usable happened in the following thread.

https://patchwork.kernel.org/patch/10276785/

At the end, it was agreed that this patch along with flocks will solve the
issue.

Nobody who reads this patch's changelog will know any of this.  Please
let's get all this information into the proper place.

Sure! Will resend the patch with updated change log.

Thanks,
Nagarathnam.

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-03 Thread Nagarathnam Muthusamy




On 04/03/2018 02:52 PM, Andrew Morton wrote:

On Tue, 3 Apr 2018 14:45:28 -0700 Nagarathnam Muthusamy 
 wrote:


This changelog doesn't explain what the value is to our users.  I
assume it is a performance optimization because "backward translation
requires scanning all tasks"?  If so, please show us real-world
examples of the performance benefit from this patch, and please go to
great lengths to explain to us why this optimisation is needed by our
users.

One of the usecase by Oracle database involves multiple levels of
nested pid namespaces and we require pid translation between the
levels. Discussions on the particular usecase, why any of the existing
methods was not usable happened in the following thread.

https://patchwork.kernel.org/patch/10276785/

At the end, it was agreed that this patch along with flocks will solve the
issue.

Nobody who reads this patch's changelog will know any of this.  Please
let's get all this information into the proper place.

Sure! Will resend the patch with updated change log.

Thanks,
Nagarathnam.

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-03 Thread Andrew Morton

On Tue, 3 Apr 2018 14:45:28 -0700 Nagarathnam Muthusamy 
 wrote:

> > This changelog doesn't explain what the value is to our users.  I
> > assume it is a performance optimization because "backward translation
> > requires scanning all tasks"?  If so, please show us real-world
> > examples of the performance benefit from this patch, and please go to
> > great lengths to explain to us why this optimisation is needed by our
> > users.
> 
> One of the usecase by Oracle database involves multiple levels of
> nested pid namespaces and we require pid translation between the
> levels. Discussions on the particular usecase, why any of the existing
> methods was not usable happened in the following thread.
> 
> https://patchwork.kernel.org/patch/10276785/
> 
> At the end, it was agreed that this patch along with flocks will solve the
> issue.

Nobody who reads this patch's changelog will know any of this.  Please
let's get all this information into the proper place.

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-03 Thread Andrew Morton

On Tue, 3 Apr 2018 14:45:28 -0700 Nagarathnam Muthusamy 
 wrote:

> > This changelog doesn't explain what the value is to our users.  I
> > assume it is a performance optimization because "backward translation
> > requires scanning all tasks"?  If so, please show us real-world
> > examples of the performance benefit from this patch, and please go to
> > great lengths to explain to us why this optimisation is needed by our
> > users.
> 
> One of the usecase by Oracle database involves multiple levels of
> nested pid namespaces and we require pid translation between the
> levels. Discussions on the particular usecase, why any of the existing
> methods was not usable happened in the following thread.
> 
> https://patchwork.kernel.org/patch/10276785/
> 
> At the end, it was agreed that this patch along with flocks will solve the
> issue.

Nobody who reads this patch's changelog will know any of this.  Please
let's get all this information into the proper place.

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-03 Thread Nagarathnam Muthusamy



On 04/03/2018 02:38 PM, Andrew Morton wrote:

On Mon,  2 Apr 2018 15:57:29 -0600 nagarathnam.muthus...@oracle.com wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Error codes:
EBADF- file descriptor is closed
EINVAL   - file descriptor isn't pid-namespace
ESRCH- task not found in @source namespace

Presumably a manpage is planned?

This changelog doesn't explain what the value is to our users.  I
assume it is a performance optimization because "backward translation
requires scanning all tasks"?  If so, please show us real-world
examples of the performance benefit from this patch, and please go to
great lengths to explain to us why this optimisation is needed by our
users.


One of the usecase by Oracle database involves multiple levels of
nested pid namespaces and we require pid translation between the
levels. Discussions on the particular usecase, why any of the existing
methods was not usable happened in the following thread.

https://patchwork.kernel.org/patch/10276785/

At the end, it was agreed that this patch along with flocks will solve the
issue.

Thanks,
Nagarathnam.

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-03 Thread Nagarathnam Muthusamy



On 04/03/2018 02:38 PM, Andrew Morton wrote:

On Mon,  2 Apr 2018 15:57:29 -0600 nagarathnam.muthus...@oracle.com wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Error codes:
EBADF- file descriptor is closed
EINVAL   - file descriptor isn't pid-namespace
ESRCH- task not found in @source namespace

Presumably a manpage is planned?

This changelog doesn't explain what the value is to our users.  I
assume it is a performance optimization because "backward translation
requires scanning all tasks"?  If so, please show us real-world
examples of the performance benefit from this patch, and please go to
great lengths to explain to us why this optimisation is needed by our
users.


One of the usecase by Oracle database involves multiple levels of
nested pid namespaces and we require pid translation between the
levels. Discussions on the particular usecase, why any of the existing
methods was not usable happened in the following thread.

https://patchwork.kernel.org/patch/10276785/

At the end, it was agreed that this patch along with flocks will solve the
issue.

Thanks,
Nagarathnam.

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-03 Thread Andrew Morton

On Mon,  2 Apr 2018 15:57:29 -0600 nagarathnam.muthus...@oracle.com wrote:

> pid_t translate_pid(pid_t pid, int source, int target);
> 
> This syscall converts pid from source pid-ns into pid in target pid-ns.
> If pid is unreachable from target pid-ns it returns zero.
> 
> Pid-namespaces are referred file descriptors opened to proc files
> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
> refers to current pid namespace, same as file /proc/self/ns/pid.
> 
> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
> translation requires scanning all tasks. Also pids could be translated
> by sending them through unix socket between namespaces, this method is
> slow and insecure because other side is exposed inside pid namespace.
> 
> Examples:
> translate_pid(pid, ns, -1)  - get pid in our pid namespace
> translate_pid(pid, -1, ns)  - get pid in other pid namespace
> translate_pid(1, ns, -1)- get pid of init task for namespace
> translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
> translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
> translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
> translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?
> 
> Error codes:
> EBADF- file descriptor is closed
> EINVAL   - file descriptor isn't pid-namespace
> ESRCH- task not found in @source namespace

Presumably a manpage is planned?

This changelog doesn't explain what the value is to our users.  I
assume it is a performance optimization because "backward translation
requires scanning all tasks"?  If so, please show us real-world
examples of the performance benefit from this patch, and please go to
great lengths to explain to us why this optimisation is needed by our
users.

Re: [RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-03 Thread Andrew Morton

On Mon,  2 Apr 2018 15:57:29 -0600 nagarathnam.muthus...@oracle.com wrote:

> pid_t translate_pid(pid_t pid, int source, int target);
> 
> This syscall converts pid from source pid-ns into pid in target pid-ns.
> If pid is unreachable from target pid-ns it returns zero.
> 
> Pid-namespaces are referred file descriptors opened to proc files
> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
> refers to current pid namespace, same as file /proc/self/ns/pid.
> 
> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
> translation requires scanning all tasks. Also pids could be translated
> by sending them through unix socket between namespaces, this method is
> slow and insecure because other side is exposed inside pid namespace.
> 
> Examples:
> translate_pid(pid, ns, -1)  - get pid in our pid namespace
> translate_pid(pid, -1, ns)  - get pid in other pid namespace
> translate_pid(1, ns, -1)- get pid of init task for namespace
> translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
> translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
> translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
> translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?
> 
> Error codes:
> EBADF- file descriptor is closed
> EINVAL   - file descriptor isn't pid-namespace
> ESRCH- task not found in @source namespace

Presumably a manpage is planned?

This changelog doesn't explain what the value is to our users.  I
assume it is a performance optimization because "backward translation
requires scanning all tasks"?  If so, please show us real-world
examples of the performance benefit from this patch, and please go to
great lengths to explain to us why this optimisation is needed by our
users.

[RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-02 Thread nagarathnam . muthusamy

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Error codes:
EBADF- file descriptor is closed
EINVAL   - file descriptor isn't pid-namespace
ESRCH- task not found in @source namespace

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Nagarathnam Muthusamy 
---

v1: https://lkml.org/lkml/2015/9/15/411
v2: https://lkml.org/lkml/2015/9/24/278
 * use namespace-fd as second/third argument
 * add -pid for getting parent pid
 * move code into kernel/sys.c next to getppid
 * drop ifdef CONFIG_PID_NS
 * add generic syscall
v3: https://lkml.org/lkml/2015/9/28/3
 * use proc_ns_fdget()
 * update description
 * rebase to next-20150925
 * fix conflict with mlock2
v4:
 * rename into translate_pid()
 * remove syscall if CONFIG_PID_NS=n
 * drop -pid for parent task
 * drop fget-fdget optimizations
 * add helper get_pid_ns_by_fd()
 * wire only into x86
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h   |  1 +
 kernel/pid_namespace.c | 66 ++
 kernel/sys_ni.c|  3 ++
 5 files changed, 72 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac21..257d839 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
 384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386translate_pid   sys_translate_pid
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183..1ebdab8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330common  pkey_alloc  sys_pkey_alloc
 331common  pkey_free   sys_pkey_free
 332common  statx   sys_statx
+333common  translate_pid   sys_translate_pid
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d..6467ebc 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
  struct file_handle __user *handle,
  int flags);
 asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_translate_pid(pid_t pid, int source, int target);
 asmlinkage long sys_process_vm_readv(pid_t pid,
 const struct iovec __user *lvec,
 unsigned long liovcnt,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 773b2b3..bb56a78 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -380,6 +381,71 @@ static void pidns_put(struct ns_common *ns)
put_pid_ns(to_pid_ns(ns));
 }
 
+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+   struct pid_namespace *pidns;
+   struct ns_common *ns;
+   struct file *file;
+
+   file = proc_ns_fget(fd);
+   if (IS_ERR(file))
+   return ERR_CAST(file);
+
+   ns = get_proc_ns(file_inode(file));
+   if (ns->ops->type == CLONE_NEWPID)
+   pidns = get_pid_ns(to_pid_ns(ns));
+   else
+   pidns = ERR_PTR(-EINVAL);
+
+   fput(file);
+   return pidns;
+}
+
+/*
+ * translate_pid - convert pid in source pid-ns into target pid-ns.
+ * @pid:pid for translation
+ * @source: pid-ns file descriptor or -1 for active namespace
+ * @target: pid-ns file

[RESEND PATCH V4] pidns: introduce syscall translate_pid

2018-04-02 Thread nagarathnam . muthusamy

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Error codes:
EBADF- file descriptor is closed
EINVAL   - file descriptor isn't pid-namespace
ESRCH- task not found in @source namespace

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Nagarathnam Muthusamy 
---

v1: https://lkml.org/lkml/2015/9/15/411
v2: https://lkml.org/lkml/2015/9/24/278
 * use namespace-fd as second/third argument
 * add -pid for getting parent pid
 * move code into kernel/sys.c next to getppid
 * drop ifdef CONFIG_PID_NS
 * add generic syscall
v3: https://lkml.org/lkml/2015/9/28/3
 * use proc_ns_fdget()
 * update description
 * rebase to next-20150925
 * fix conflict with mlock2
v4:
 * rename into translate_pid()
 * remove syscall if CONFIG_PID_NS=n
 * drop -pid for parent task
 * drop fget-fdget optimizations
 * add helper get_pid_ns_by_fd()
 * wire only into x86
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h   |  1 +
 kernel/pid_namespace.c | 66 ++
 kernel/sys_ni.c|  3 ++
 5 files changed, 72 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac21..257d839 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
 384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386translate_pid   sys_translate_pid
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183..1ebdab8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330common  pkey_alloc  sys_pkey_alloc
 331common  pkey_free   sys_pkey_free
 332common  statx   sys_statx
+333common  translate_pid   sys_translate_pid
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d..6467ebc 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
  struct file_handle __user *handle,
  int flags);
 asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_translate_pid(pid_t pid, int source, int target);
 asmlinkage long sys_process_vm_readv(pid_t pid,
 const struct iovec __user *lvec,
 unsigned long liovcnt,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 773b2b3..bb56a78 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -380,6 +381,71 @@ static void pidns_put(struct ns_common *ns)
put_pid_ns(to_pid_ns(ns));
 }
 
+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+   struct pid_namespace *pidns;
+   struct ns_common *ns;
+   struct file *file;
+
+   file = proc_ns_fget(fd);
+   if (IS_ERR(file))
+   return ERR_CAST(file);
+
+   ns = get_proc_ns(file_inode(file));
+   if (ns->ops->type == CLONE_NEWPID)
+   pidns = get_pid_ns(to_pid_ns(ns));
+   else
+   pidns = ERR_PTR(-EINVAL);
+
+   fput(file);
+   return pidns;
+}
+
+/*
+ * translate_pid - convert pid in source pid-ns into target pid-ns.
+ * @pid:pid for translation
+ * @source: pid-ns file descriptor or -1 for active namespace
+ * @target: pid-ns file descriptor or -1 for active namesapce
+ *
+ * Returns pid in

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-11-01 Thread prakash.sangappa




On 11/01/2017 10:43 AM, Jann Horn wrote:

On Tue, Oct 17, 2017 at 5:38 PM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown abort
of the instance.

But if kill(pid, 0) returns 0, that doesn't tell you anything, right?
It could be that
the process you're trying to check is still alive, but it could also
be that it has
died, ns_last_pid has wrapped around, and the PID is now being reused by
another process, right?


That is true. Database checks the process start time by reading 
/proc//stat

file to verify that it is the correct process.



Wouldn't it be more reliable to open("/proc/self", O_RDONLY)
(or /proc/thread-self) in the process you want to monitor, then send
the resulting file descriptor to the monitoring process with SCM_RIGHTS?
Then something like this should work for checking whether the process
is still alive without relying on PIDs at all:

 int retval = faccessat(child_proc_self_fd, "stat", F_OK, 0);
 if (retval == 0) {
   /* process still exists */
 } else if (retval == -1 && errno == ESRCH) {
   /* process is gone */
 } else {
   err(1, "unexpected fstatat result");
 }


Yes, but there will be a large number of processes to deal with
and few  processes monitoring. All these processes would have to
open /proc/self and send fd to all the monitoring processes. In the
database case, there is one fixed  monitoring process, but other
processes monitoring can exit and new ones started.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-11-01 Thread prakash.sangappa




On 11/01/2017 10:43 AM, Jann Horn wrote:

On Tue, Oct 17, 2017 at 5:38 PM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown abort
of the instance.

But if kill(pid, 0) returns 0, that doesn't tell you anything, right?
It could be that
the process you're trying to check is still alive, but it could also
be that it has
died, ns_last_pid has wrapped around, and the PID is now being reused by
another process, right?


That is true. Database checks the process start time by reading 
/proc//stat

file to verify that it is the correct process.



Wouldn't it be more reliable to open("/proc/self", O_RDONLY)
(or /proc/thread-self) in the process you want to monitor, then send
the resulting file descriptor to the monitoring process with SCM_RIGHTS?
Then something like this should work for checking whether the process
is still alive without relying on PIDs at all:

 int retval = faccessat(child_proc_self_fd, "stat", F_OK, 0);
 if (retval == 0) {
   /* process still exists */
 } else if (retval == -1 && errno == ESRCH) {
   /* process is gone */
 } else {
   err(1, "unexpected fstatat result");
 }


Yes, but there will be a large number of processes to deal with
and few  processes monitoring. All these processes would have to
open /proc/self and send fd to all the monitoring processes. In the
database case, there is one fixed  monitoring process, but other
processes monitoring can exit and new ones started.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-11-01 Thread Jann Horn

On Tue, Oct 17, 2017 at 5:38 PM, Prakash Sangappa
 wrote:
>
>
> On 10/16/17 5:52 PM, Andy Lutomirski wrote:
>>
>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
>>  wrote:
>>>
>>>
>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



 On 10/16/2017 02:36 PM, Andrew Morton wrote:
>
> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>  wrote:
>
> pid_t translate_pid(pid_t pid, int source, int target);
>
> This syscall converts pid from source pid-ns into pid in target
> pid-ns.
> If pid is unreachable from target pid-ns it returns zero.
>
> Pid-namespaces are referred file descriptors opened to proc files
> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
> argument
> refers to current pid namespace, same as file /proc/self/ns/pid.
>
> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
> backward
> translation requires scanning all tasks. Also pids could be
> translated
> by sending them through unix socket between namespaces, this method
> is
> slow and insecure because other side is exposed inside pid
> namespace.
>>
>> Andrew asked why we might need this.
>>
>> Such conversion is required for interaction between processes across
>> pid-namespaces.
>> For example to identify process in container by pid file looking from
>> outside.
>>
>> Two years ago I've solved this in project of mine with monstrous code
>> which
>> forks couple times just to convert pid, lucky for me performance
>> wasn't
>> important.
>
> That's a single user who needed this a single time, and found a
> userspace-based solution anyway.  This is not exactly compelling!
>
> Is there a stronger case to be made?  How does this change benefit our
> users?  Sell it to us!

 Oracle database is planning to use pid namespace for sandboxing database
 instances and they need an API similar to translate_pid to effectively
 translate process IDs from other pid namespaces. Prakash (cced in mail)
 can
 provide more details on this usecase.
>>>
>>>
>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces
>>> and
>>> needs a direct method of converting pids of processes in the pid
>>> namespace
>>> hierarchy. In this use case multiple
>>> nested PID namespaces will be used.  The currently available mechanism
>>> are
>>> not very efficient for this use case. For ex. as Konstantin described,
>>> using
>>> /proc//status would require the application to scan all the pid's
>>> status files to determine the pid of given process in a child namespace.
>>>
>>> Use of SCM_CREDENTIALS's socket message is another way, which would
>>> require
>>> every process starting inside a pid namespace to send this message and
>>> the
>>> receiving process in the target namespace would have to save the
>>> converted
>>> pid and reference it. This mechanism becomes cumbersome especially if the
>>> application has to deal with multiple nested pid namespaces. Also, the
>>> Database needs to be able to convert a thread's global pid(gettid()).
>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
>>> CAP_SYS_ADMIN, which is an issue.
>>>
>>> So having a direct method, like the API that Konstantin is proposing,
>>> will
>>> work best for the Database
>>> since pid of a process in any of the nested pid namespaces can be
>>> converted
>>> as and when required. I think with the proposed API, the application
>>> should
>>> be able to convert pid of a process or tid(gettid()) of a thread as well.
>>>
>>
>> Can you explain what Oracle's database is planning to do with this
>> information?
>
>
> Database uses the PID to programmatically find out if the process/thread is
> alive(kill 0) also send signals to the processes requesting it to dump
> status/debug information and kill the processes in case of a shutdown abort
> of the instance.

But if kill(pid, 0) returns 0, that doesn't tell you anything, right?
It could be that
the process you're trying to check is still alive, but it could also
be that it has
died, ns_last_pid has wrapped around, and the PID is now being reused by
another process, right?

Wouldn't it be more reliable to open("/proc/self", O_RDONLY)
(or /proc/thread-self) in the process you want to monitor, then send
the resulting file descriptor to the monitoring process with SCM_RIGHTS?
Then something like this should work for checking whether the process
is still alive without relying on PIDs at all:

int retval = faccessat(child_proc_self_fd, "stat", F_OK, 0);
if (retval == 0) {
  /* process still exists */
} else if (retval == -1 && errno == ESRCH) {
  /* process is gone */
} else {
  err(1, "unexpected

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-11-01 Thread Jann Horn

On Tue, Oct 17, 2017 at 5:38 PM, Prakash Sangappa
 wrote:
>
>
> On 10/16/17 5:52 PM, Andy Lutomirski wrote:
>>
>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
>>  wrote:
>>>
>>>
>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



 On 10/16/2017 02:36 PM, Andrew Morton wrote:
>
> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>  wrote:
>
> pid_t translate_pid(pid_t pid, int source, int target);
>
> This syscall converts pid from source pid-ns into pid in target
> pid-ns.
> If pid is unreachable from target pid-ns it returns zero.
>
> Pid-namespaces are referred file descriptors opened to proc files
> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
> argument
> refers to current pid namespace, same as file /proc/self/ns/pid.
>
> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
> backward
> translation requires scanning all tasks. Also pids could be
> translated
> by sending them through unix socket between namespaces, this method
> is
> slow and insecure because other side is exposed inside pid
> namespace.
>>
>> Andrew asked why we might need this.
>>
>> Such conversion is required for interaction between processes across
>> pid-namespaces.
>> For example to identify process in container by pid file looking from
>> outside.
>>
>> Two years ago I've solved this in project of mine with monstrous code
>> which
>> forks couple times just to convert pid, lucky for me performance
>> wasn't
>> important.
>
> That's a single user who needed this a single time, and found a
> userspace-based solution anyway.  This is not exactly compelling!
>
> Is there a stronger case to be made?  How does this change benefit our
> users?  Sell it to us!

 Oracle database is planning to use pid namespace for sandboxing database
 instances and they need an API similar to translate_pid to effectively
 translate process IDs from other pid namespaces. Prakash (cced in mail)
 can
 provide more details on this usecase.
>>>
>>>
>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces
>>> and
>>> needs a direct method of converting pids of processes in the pid
>>> namespace
>>> hierarchy. In this use case multiple
>>> nested PID namespaces will be used.  The currently available mechanism
>>> are
>>> not very efficient for this use case. For ex. as Konstantin described,
>>> using
>>> /proc//status would require the application to scan all the pid's
>>> status files to determine the pid of given process in a child namespace.
>>>
>>> Use of SCM_CREDENTIALS's socket message is another way, which would
>>> require
>>> every process starting inside a pid namespace to send this message and
>>> the
>>> receiving process in the target namespace would have to save the
>>> converted
>>> pid and reference it. This mechanism becomes cumbersome especially if the
>>> application has to deal with multiple nested pid namespaces. Also, the
>>> Database needs to be able to convert a thread's global pid(gettid()).
>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
>>> CAP_SYS_ADMIN, which is an issue.
>>>
>>> So having a direct method, like the API that Konstantin is proposing,
>>> will
>>> work best for the Database
>>> since pid of a process in any of the nested pid namespaces can be
>>> converted
>>> as and when required. I think with the proposed API, the application
>>> should
>>> be able to convert pid of a process or tid(gettid()) of a thread as well.
>>>
>>
>> Can you explain what Oracle's database is planning to do with this
>> information?
>
>
> Database uses the PID to programmatically find out if the process/thread is
> alive(kill 0) also send signals to the processes requesting it to dump
> status/debug information and kill the processes in case of a shutdown abort
> of the instance.

But if kill(pid, 0) returns 0, that doesn't tell you anything, right?
It could be that
the process you're trying to check is still alive, but it could also
be that it has
died, ns_last_pid has wrapped around, and the PID is now being reused by
another process, right?

Wouldn't it be more reliable to open("/proc/self", O_RDONLY)
(or /proc/thread-self) in the process you want to monitor, then send
the resulting file descriptor to the monitoring process with SCM_RIGHTS?
Then something like this should work for checking whether the process
is still alive without relying on PIDs at all:

int retval = faccessat(child_proc_self_fd, "stat", F_OK, 0);
if (retval == 0) {
  /* process still exists */
} else if (retval == -1 && errno == ESRCH) {
  /* process is gone */
} else {
  err(1, "unexpected fstatat result");
}

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-11-01 Thread nagarathnam muthusamy

I believe all the questions raised in this thread were answered. Just 
wondering if there are any outstanding questions?


Thanks,
Nagarathnam.
On 10/17/2017 3:53 PM, prakash sangappa wrote:


On 10/17/2017 3:40 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa
 wrote:

On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in 
target

pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to 
proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. 
Negative

argument
refers to current pid namespace, same as file 
/proc/self/ns/pid.


Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this
method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes 
across

pid-namespaces.
For example to identify process in container by pid file looking
from
outside.

Two years ago I've solved this in project of mine with monstrous
code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change 
benefit

our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing
database
instances and they need an API similar to translate_pid to 
effectively

translate process IDs from other pid namespaces. Prakash (cced in
mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid 
namespaces

and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available 
mechanism

are
not very efficient for this use case. For ex. as Konstantin 
described,

using
/proc//status would require the application to scan all the 
pid's

status files to determine the pid of given process in a child
namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this 
message and

the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome 
especially if

the
application has to deal with multiple nested pid namespaces. 
Also, the
Database needs to be able to convert a thread's global 
pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message 
requires

CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is 
proposing,

will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the 
application

should
be able to convert pid of a process or tid(gettid()) of a thread as
well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the 
process/thread

is
alive(kill 0) also send signals to the processes requesting it to 
dump

status/debug information and kill the processes in case of a shutdown
abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The 
above

description
was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes 
inside
its children namespaces to PID in its namespace and use that pid to 
issue

kill().

Seems vaguely sensible.

If I were designing this type of system, I'd have a manager process in
each namespace running as PID 1, though -- PID 1 is special and needs
to understand what's going on anyway.  Then PID 1 would do the kill()
calls and wouldn't need translate_pid().


Yes, this has been tried out with the prototype use of PID namespaces 
in the DB.
It works, but would be slow as the manager would have to exchange 
messages with the

controlling processes which would be in the parent namespace.
DB could use the api to convert

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-11-01 Thread nagarathnam muthusamy

I believe all the questions raised in this thread were answered. Just 
wondering if there are any outstanding questions?


Thanks,
Nagarathnam.
On 10/17/2017 3:53 PM, prakash sangappa wrote:


On 10/17/2017 3:40 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa
 wrote:

On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in 
target

pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to 
proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. 
Negative

argument
refers to current pid namespace, same as file 
/proc/self/ns/pid.


Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this
method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes 
across

pid-namespaces.
For example to identify process in container by pid file looking
from
outside.

Two years ago I've solved this in project of mine with monstrous
code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change 
benefit

our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing
database
instances and they need an API similar to translate_pid to 
effectively

translate process IDs from other pid namespaces. Prakash (cced in
mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid 
namespaces

and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available 
mechanism

are
not very efficient for this use case. For ex. as Konstantin 
described,

using
/proc//status would require the application to scan all the 
pid's

status files to determine the pid of given process in a child
namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this 
message and

the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome 
especially if

the
application has to deal with multiple nested pid namespaces. 
Also, the
Database needs to be able to convert a thread's global 
pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message 
requires

CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is 
proposing,

will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the 
application

should
be able to convert pid of a process or tid(gettid()) of a thread as
well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the 
process/thread

is
alive(kill 0) also send signals to the processes requesting it to 
dump

status/debug information and kill the processes in case of a shutdown
abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The 
above

description
was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes 
inside
its children namespaces to PID in its namespace and use that pid to 
issue

kill().

Seems vaguely sensible.

If I were designing this type of system, I'd have a manager process in
each namespace running as PID 1, though -- PID 1 is special and needs
to understand what's going on anyway.  Then PID 1 would do the kill()
calls and wouldn't need translate_pid().


Yes, this has been tried out with the prototype use of PID namespaces 
in the DB.
It works, but would be slow as the manager would have to exchange 
messages with the

controlling processes which would be in the parent namespace.
DB could use the api to convert the pid.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread prakash sangappa



On 10/17/2017 3:40 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa
 wrote:

On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this
method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking
from
outside.

Two years ago I've solved this in project of mine with monstrous
code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit
our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing
database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in
mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child
namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if
the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as
well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread
is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown
abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The above
description
was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes inside
its children namespaces to PID in its namespace and use that pid to issue
kill().

Seems vaguely sensible.

If I were designing this type of system, I'd have a manager process in
each namespace running as PID 1, though -- PID 1 is special and needs
to understand what's going on anyway.  Then PID 1 would do the kill()
calls and wouldn't need translate_pid().


Yes, this has been tried out with the prototype use of PID namespaces in 
the DB.
It works, but would be slow as the manager would have to exchange 
messages with the

controlling processes which would be in the parent namespace.
DB could use the api to convert the pid.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread prakash sangappa



On 10/17/2017 3:40 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa
 wrote:

On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this
method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking
from
outside.

Two years ago I've solved this in project of mine with monstrous
code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit
our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing
database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in
mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child
namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if
the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as
well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread
is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown
abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The above
description
was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes inside
its children namespaces to PID in its namespace and use that pid to issue
kill().

Seems vaguely sensible.

If I were designing this type of system, I'd have a manager process in
each namespace running as PID 1, though -- PID 1 is special and needs
to understand what's going on anyway.  Then PID 1 would do the kill()
calls and wouldn't need translate_pid().


Yes, this has been tried out with the prototype use of PID namespaces in 
the DB.
It works, but would be slow as the manager would have to exchange 
messages with the

controlling processes which would be in the parent namespace.
DB could use the api to convert the pid.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Andy Lutomirski

On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa
 wrote:
>
> On 10/17/2017 3:02 PM, Andy Lutomirski wrote:
>>
>> On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
>>  wrote:
>>>
>>>
>>> On 10/16/17 5:52 PM, Andy Lutomirski wrote:

 On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
  wrote:
>
>
> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:
>>
>>
>>
>> On 10/16/2017 02:36 PM, Andrew Morton wrote:
>>>
>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>>>  wrote:
>>>
>>> pid_t translate_pid(pid_t pid, int source, int target);
>>>
>>> This syscall converts pid from source pid-ns into pid in target
>>> pid-ns.
>>> If pid is unreachable from target pid-ns it returns zero.
>>>
>>> Pid-namespaces are referred file descriptors opened to proc files
>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
>>> argument
>>> refers to current pid namespace, same as file /proc/self/ns/pid.
>>>
>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
>>> backward
>>> translation requires scanning all tasks. Also pids could be
>>> translated
>>> by sending them through unix socket between namespaces, this
>>> method
>>> is
>>> slow and insecure because other side is exposed inside pid
>>> namespace.

 Andrew asked why we might need this.

 Such conversion is required for interaction between processes across
 pid-namespaces.
 For example to identify process in container by pid file looking
 from
 outside.

 Two years ago I've solved this in project of mine with monstrous
 code
 which
 forks couple times just to convert pid, lucky for me performance
 wasn't
 important.
>>>
>>> That's a single user who needed this a single time, and found a
>>> userspace-based solution anyway.  This is not exactly compelling!
>>>
>>> Is there a stronger case to be made?  How does this change benefit
>>> our
>>> users?  Sell it to us!
>>
>> Oracle database is planning to use pid namespace for sandboxing
>> database
>> instances and they need an API similar to translate_pid to effectively
>> translate process IDs from other pid namespaces. Prakash (cced in
>> mail)
>> can
>> provide more details on this usecase.
>
>
> As Nagarathnam indicated, Oracle Database will be using pid namespaces
> and
> needs a direct method of converting pids of processes in the pid
> namespace
> hierarchy. In this use case multiple
> nested PID namespaces will be used.  The currently available mechanism
> are
> not very efficient for this use case. For ex. as Konstantin described,
> using
> /proc//status would require the application to scan all the pid's
> status files to determine the pid of given process in a child
> namespace.
>
> Use of SCM_CREDENTIALS's socket message is another way, which would
> require
> every process starting inside a pid namespace to send this message and
> the
> receiving process in the target namespace would have to save the
> converted
> pid and reference it. This mechanism becomes cumbersome especially if
> the
> application has to deal with multiple nested pid namespaces. Also, the
> Database needs to be able to convert a thread's global pid(gettid()).
> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
> CAP_SYS_ADMIN, which is an issue.
>
> So having a direct method, like the API that Konstantin is proposing,
> will
> work best for the Database
> since pid of a process in any of the nested pid namespaces can be
> converted
> as and when required. I think with the proposed API, the application
> should
> be able to convert pid of a process or tid(gettid()) of a thread as
> well.
>
 Can you explain what Oracle's database is planning to do with this
 information?
>>>
>>>
>>> Database uses the PID to programmatically find out if the process/thread
>>> is
>>> alive(kill 0) also send signals to the processes requesting it to dump
>>> status/debug information and kill the processes in case of a shutdown
>>> abort
>>> of the instance.
>>
>> What I'm wondering is: how does the caller of kill() end up
>> controlling a task whose pid it doesn't know in its own namespace?
>
>
> I was generally describing how DB would use the PID of process. The above
> description
> was in the case when no namespaces are used.
>
> With use of namespaces, the DB would convert the PID of processes inside
> its children namespaces to PID in its

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Andy Lutomirski

On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa
 wrote:
>
> On 10/17/2017 3:02 PM, Andy Lutomirski wrote:
>>
>> On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
>>  wrote:
>>>
>>>
>>> On 10/16/17 5:52 PM, Andy Lutomirski wrote:

 On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
  wrote:
>
>
> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:
>>
>>
>>
>> On 10/16/2017 02:36 PM, Andrew Morton wrote:
>>>
>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>>>  wrote:
>>>
>>> pid_t translate_pid(pid_t pid, int source, int target);
>>>
>>> This syscall converts pid from source pid-ns into pid in target
>>> pid-ns.
>>> If pid is unreachable from target pid-ns it returns zero.
>>>
>>> Pid-namespaces are referred file descriptors opened to proc files
>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
>>> argument
>>> refers to current pid namespace, same as file /proc/self/ns/pid.
>>>
>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
>>> backward
>>> translation requires scanning all tasks. Also pids could be
>>> translated
>>> by sending them through unix socket between namespaces, this
>>> method
>>> is
>>> slow and insecure because other side is exposed inside pid
>>> namespace.

 Andrew asked why we might need this.

 Such conversion is required for interaction between processes across
 pid-namespaces.
 For example to identify process in container by pid file looking
 from
 outside.

 Two years ago I've solved this in project of mine with monstrous
 code
 which
 forks couple times just to convert pid, lucky for me performance
 wasn't
 important.
>>>
>>> That's a single user who needed this a single time, and found a
>>> userspace-based solution anyway.  This is not exactly compelling!
>>>
>>> Is there a stronger case to be made?  How does this change benefit
>>> our
>>> users?  Sell it to us!
>>
>> Oracle database is planning to use pid namespace for sandboxing
>> database
>> instances and they need an API similar to translate_pid to effectively
>> translate process IDs from other pid namespaces. Prakash (cced in
>> mail)
>> can
>> provide more details on this usecase.
>
>
> As Nagarathnam indicated, Oracle Database will be using pid namespaces
> and
> needs a direct method of converting pids of processes in the pid
> namespace
> hierarchy. In this use case multiple
> nested PID namespaces will be used.  The currently available mechanism
> are
> not very efficient for this use case. For ex. as Konstantin described,
> using
> /proc//status would require the application to scan all the pid's
> status files to determine the pid of given process in a child
> namespace.
>
> Use of SCM_CREDENTIALS's socket message is another way, which would
> require
> every process starting inside a pid namespace to send this message and
> the
> receiving process in the target namespace would have to save the
> converted
> pid and reference it. This mechanism becomes cumbersome especially if
> the
> application has to deal with multiple nested pid namespaces. Also, the
> Database needs to be able to convert a thread's global pid(gettid()).
> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
> CAP_SYS_ADMIN, which is an issue.
>
> So having a direct method, like the API that Konstantin is proposing,
> will
> work best for the Database
> since pid of a process in any of the nested pid namespaces can be
> converted
> as and when required. I think with the proposed API, the application
> should
> be able to convert pid of a process or tid(gettid()) of a thread as
> well.
>
 Can you explain what Oracle's database is planning to do with this
 information?
>>>
>>>
>>> Database uses the PID to programmatically find out if the process/thread
>>> is
>>> alive(kill 0) also send signals to the processes requesting it to dump
>>> status/debug information and kill the processes in case of a shutdown
>>> abort
>>> of the instance.
>>
>> What I'm wondering is: how does the caller of kill() end up
>> controlling a task whose pid it doesn't know in its own namespace?
>
>
> I was generally describing how DB would use the PID of process. The above
> description
> was in the case when no namespaces are used.
>
> With use of namespaces, the DB would convert the PID of processes inside
> its children namespaces to PID in its namespace and use that pid to issue
> kill().

Seems vaguely sensible.

If I were designing this type of system, I'd

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread prakash sangappa



On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The 
above description

was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes inside
its children namespaces to PID in its namespace and use that pid to 
issue kill().


-Prakash.




-Prakash.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread prakash sangappa



On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The 
above description

was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes inside
its children namespaces to PID in its namespace and use that pid to 
issue kill().


-Prakash.




-Prakash.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Andy Lutomirski

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:
>
>
> On 10/16/17 5:52 PM, Andy Lutomirski wrote:
>>
>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
>>  wrote:
>>>
>>>
>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



 On 10/16/2017 02:36 PM, Andrew Morton wrote:
>
> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>  wrote:
>
> pid_t translate_pid(pid_t pid, int source, int target);
>
> This syscall converts pid from source pid-ns into pid in target
> pid-ns.
> If pid is unreachable from target pid-ns it returns zero.
>
> Pid-namespaces are referred file descriptors opened to proc files
> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
> argument
> refers to current pid namespace, same as file /proc/self/ns/pid.
>
> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
> backward
> translation requires scanning all tasks. Also pids could be
> translated
> by sending them through unix socket between namespaces, this method
> is
> slow and insecure because other side is exposed inside pid
> namespace.
>>
>> Andrew asked why we might need this.
>>
>> Such conversion is required for interaction between processes across
>> pid-namespaces.
>> For example to identify process in container by pid file looking from
>> outside.
>>
>> Two years ago I've solved this in project of mine with monstrous code
>> which
>> forks couple times just to convert pid, lucky for me performance
>> wasn't
>> important.
>
> That's a single user who needed this a single time, and found a
> userspace-based solution anyway.  This is not exactly compelling!
>
> Is there a stronger case to be made?  How does this change benefit our
> users?  Sell it to us!

 Oracle database is planning to use pid namespace for sandboxing database
 instances and they need an API similar to translate_pid to effectively
 translate process IDs from other pid namespaces. Prakash (cced in mail)
 can
 provide more details on this usecase.
>>>
>>>
>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces
>>> and
>>> needs a direct method of converting pids of processes in the pid
>>> namespace
>>> hierarchy. In this use case multiple
>>> nested PID namespaces will be used.  The currently available mechanism
>>> are
>>> not very efficient for this use case. For ex. as Konstantin described,
>>> using
>>> /proc//status would require the application to scan all the pid's
>>> status files to determine the pid of given process in a child namespace.
>>>
>>> Use of SCM_CREDENTIALS's socket message is another way, which would
>>> require
>>> every process starting inside a pid namespace to send this message and
>>> the
>>> receiving process in the target namespace would have to save the
>>> converted
>>> pid and reference it. This mechanism becomes cumbersome especially if the
>>> application has to deal with multiple nested pid namespaces. Also, the
>>> Database needs to be able to convert a thread's global pid(gettid()).
>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
>>> CAP_SYS_ADMIN, which is an issue.
>>>
>>> So having a direct method, like the API that Konstantin is proposing,
>>> will
>>> work best for the Database
>>> since pid of a process in any of the nested pid namespaces can be
>>> converted
>>> as and when required. I think with the proposed API, the application
>>> should
>>> be able to convert pid of a process or tid(gettid()) of a thread as well.
>>>
>>
>> Can you explain what Oracle's database is planning to do with this
>> information?
>
>
> Database uses the PID to programmatically find out if the process/thread is
> alive(kill 0) also send signals to the processes requesting it to dump
> status/debug information and kill the processes in case of a shutdown abort
> of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?

>
> -Prakash.
>
>

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Andy Lutomirski

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:
>
>
> On 10/16/17 5:52 PM, Andy Lutomirski wrote:
>>
>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
>>  wrote:
>>>
>>>
>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



 On 10/16/2017 02:36 PM, Andrew Morton wrote:
>
> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>  wrote:
>
> pid_t translate_pid(pid_t pid, int source, int target);
>
> This syscall converts pid from source pid-ns into pid in target
> pid-ns.
> If pid is unreachable from target pid-ns it returns zero.
>
> Pid-namespaces are referred file descriptors opened to proc files
> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
> argument
> refers to current pid namespace, same as file /proc/self/ns/pid.
>
> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
> backward
> translation requires scanning all tasks. Also pids could be
> translated
> by sending them through unix socket between namespaces, this method
> is
> slow and insecure because other side is exposed inside pid
> namespace.
>>
>> Andrew asked why we might need this.
>>
>> Such conversion is required for interaction between processes across
>> pid-namespaces.
>> For example to identify process in container by pid file looking from
>> outside.
>>
>> Two years ago I've solved this in project of mine with monstrous code
>> which
>> forks couple times just to convert pid, lucky for me performance
>> wasn't
>> important.
>
> That's a single user who needed this a single time, and found a
> userspace-based solution anyway.  This is not exactly compelling!
>
> Is there a stronger case to be made?  How does this change benefit our
> users?  Sell it to us!

 Oracle database is planning to use pid namespace for sandboxing database
 instances and they need an API similar to translate_pid to effectively
 translate process IDs from other pid namespaces. Prakash (cced in mail)
 can
 provide more details on this usecase.
>>>
>>>
>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces
>>> and
>>> needs a direct method of converting pids of processes in the pid
>>> namespace
>>> hierarchy. In this use case multiple
>>> nested PID namespaces will be used.  The currently available mechanism
>>> are
>>> not very efficient for this use case. For ex. as Konstantin described,
>>> using
>>> /proc//status would require the application to scan all the pid's
>>> status files to determine the pid of given process in a child namespace.
>>>
>>> Use of SCM_CREDENTIALS's socket message is another way, which would
>>> require
>>> every process starting inside a pid namespace to send this message and
>>> the
>>> receiving process in the target namespace would have to save the
>>> converted
>>> pid and reference it. This mechanism becomes cumbersome especially if the
>>> application has to deal with multiple nested pid namespaces. Also, the
>>> Database needs to be able to convert a thread's global pid(gettid()).
>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
>>> CAP_SYS_ADMIN, which is an issue.
>>>
>>> So having a direct method, like the API that Konstantin is proposing,
>>> will
>>> work best for the Database
>>> since pid of a process in any of the nested pid namespaces can be
>>> converted
>>> as and when required. I think with the proposed API, the application
>>> should
>>> be able to convert pid of a process or tid(gettid()) of a thread as well.
>>>
>>
>> Can you explain what Oracle's database is planning to do with this
>> information?
>
>
> Database uses the PID to programmatically find out if the process/thread is
> alive(kill 0) also send signals to the processes requesting it to dump
> status/debug information and kill the processes in case of a shutdown abort
> of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?

>
> -Prakash.
>
>

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Prakash Sangappa




On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail) can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces and
needs a direct method of converting pids of processes in the pid namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism are
not very efficient for this use case. For ex. as Konstantin described, using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would require
every process starting inside a pid namespace to send this message and the
receiving process in the target namespace would have to save the converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing, will
work best for the Database
since pid of a process in any of the nested pid namespaces can be converted
as and when required. I think with the proposed API, the application should
be able to convert pid of a process or tid(gettid()) of a thread as well.



Can you explain what Oracle's database is planning to do with this information?


Database uses the PID to programmatically find out if the process/thread 
is alive(kill 0) also send signals to the processes requesting it to 
dump status/debug information and kill the processes in case of a 
shutdown abort of the instance.


-Prakash.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Prakash Sangappa




On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail) can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces and
needs a direct method of converting pids of processes in the pid namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism are
not very efficient for this use case. For ex. as Konstantin described, using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would require
every process starting inside a pid namespace to send this message and the
receiving process in the target namespace would have to save the converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing, will
work best for the Database
since pid of a process in any of the nested pid namespaces can be converted
as and when required. I think with the proposed API, the application should
be able to convert pid of a process or tid(gettid()) of a thread as well.



Can you explain what Oracle's database is planning to do with this information?


Database uses the PID to programmatically find out if the process/thread 
is alive(kill 0) also send signals to the processes requesting it to 
dump status/debug information and kill the processes in case of a 
shutdown abort of the instance.


-Prakash.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Konstantin Khlebnikov


On 17.10.2017 00:05, Nagarathnam Muthusamy wrote:



On 10/16/2017 09:24 AM, Oleg Nesterov wrote:

On 10/13, Konstantin Khlebnikov wrote:


On 13.10.2017 19:05, Oleg Nesterov wrote:

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

That was in v3.

I'll prefer to this later, separately. And replace fget with fdget which
allows to do this without atomic operations if task is single-threaded.

OK, agreed,


Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}

Yes, this is more limited... Do you have a use-case when this is not enough?

That was in v1 but considered too racy.

Hmm, I don't understand...

Yes sure, this is racy but open("/proc/$pid/ns/pid") is racy too?

OK, once you do fd=open("/proc/$pid/ns/pid") you can use this fd even after
its owner exits, while find_task_by_vpid() will fail or find another task if
this pid was already reused.

But once again, do you have a use-case when this is important?


I believe that in V1 Eric pointed out that pid in general is not a clean way to 
represent
namespace. (https://lkml.org/lkml/2015/9/22/1087) Few old interfaces used pid only because at that time there was no better way to represent 
namespaces.




Yeah, that was a reason.

If we think further - all syscalls who operates with non-child tasks racy and
must be be replaced with some kind of pidfd or taskfd.

Eric pointed that too: https://lkml.org/lkml/2015/9/28/508




But we could merge both ways:

source >= 0 - pidns fs
source < 0  - task_pid = -source

But for what? I must have missed something...


I mean we could have both ways to point namespace in one agrument.
Some classic syscalls emply similar magic for negative pids.

This is cheap and looks almost sane. =)



Oleg.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Konstantin Khlebnikov


On 17.10.2017 00:05, Nagarathnam Muthusamy wrote:



On 10/16/2017 09:24 AM, Oleg Nesterov wrote:

On 10/13, Konstantin Khlebnikov wrote:


On 13.10.2017 19:05, Oleg Nesterov wrote:

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

That was in v3.

I'll prefer to this later, separately. And replace fget with fdget which
allows to do this without atomic operations if task is single-threaded.

OK, agreed,


Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}

Yes, this is more limited... Do you have a use-case when this is not enough?

That was in v1 but considered too racy.

Hmm, I don't understand...

Yes sure, this is racy but open("/proc/$pid/ns/pid") is racy too?

OK, once you do fd=open("/proc/$pid/ns/pid") you can use this fd even after
its owner exits, while find_task_by_vpid() will fail or find another task if
this pid was already reused.

But once again, do you have a use-case when this is important?


I believe that in V1 Eric pointed out that pid in general is not a clean way to 
represent
namespace. (https://lkml.org/lkml/2015/9/22/1087) Few old interfaces used pid only because at that time there was no better way to represent 
namespaces.




Yeah, that was a reason.

If we think further - all syscalls who operates with non-child tasks racy and
must be be replaced with some kind of pidfd or taskfd.

Eric pointed that too: https://lkml.org/lkml/2015/9/28/508




But we could merge both ways:

source >= 0 - pidns fs
source < 0  - task_pid = -source

But for what? I must have missed something...


I mean we could have both ways to point namespace in one agrument.
Some classic syscalls emply similar magic for negative pids.

This is cheap and looks almost sane. =)



Oleg.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Andy Lutomirski

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:
>
>
> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:
>>
>>
>>
>> On 10/16/2017 02:36 PM, Andrew Morton wrote:
>>>
>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>>>  wrote:
>>>
>>> pid_t translate_pid(pid_t pid, int source, int target);
>>>
>>> This syscall converts pid from source pid-ns into pid in target
>>> pid-ns.
>>> If pid is unreachable from target pid-ns it returns zero.
>>>
>>> Pid-namespaces are referred file descriptors opened to proc files
>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
>>> argument
>>> refers to current pid namespace, same as file /proc/self/ns/pid.
>>>
>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
>>> translation requires scanning all tasks. Also pids could be
>>> translated
>>> by sending them through unix socket between namespaces, this method
>>> is
>>> slow and insecure because other side is exposed inside pid namespace.

 Andrew asked why we might need this.

 Such conversion is required for interaction between processes across
 pid-namespaces.
 For example to identify process in container by pid file looking from
 outside.

 Two years ago I've solved this in project of mine with monstrous code
 which
 forks couple times just to convert pid, lucky for me performance wasn't
 important.
>>>
>>> That's a single user who needed this a single time, and found a
>>> userspace-based solution anyway.  This is not exactly compelling!
>>>
>>> Is there a stronger case to be made?  How does this change benefit our
>>> users?  Sell it to us!
>>
>> Oracle database is planning to use pid namespace for sandboxing database
>> instances and they need an API similar to translate_pid to effectively
>> translate process IDs from other pid namespaces. Prakash (cced in mail) can
>> provide more details on this usecase.
>
>
> As Nagarathnam indicated, Oracle Database will be using pid namespaces and
> needs a direct method of converting pids of processes in the pid namespace
> hierarchy. In this use case multiple
> nested PID namespaces will be used.  The currently available mechanism are
> not very efficient for this use case. For ex. as Konstantin described, using
> /proc//status would require the application to scan all the pid's
> status files to determine the pid of given process in a child namespace.
>
> Use of SCM_CREDENTIALS's socket message is another way, which would require
> every process starting inside a pid namespace to send this message and the
> receiving process in the target namespace would have to save the converted
> pid and reference it. This mechanism becomes cumbersome especially if the
> application has to deal with multiple nested pid namespaces. Also, the
> Database needs to be able to convert a thread's global pid(gettid()).
> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
> CAP_SYS_ADMIN, which is an issue.
>
> So having a direct method, like the API that Konstantin is proposing, will
> work best for the Database
> since pid of a process in any of the nested pid namespaces can be converted
> as and when required. I think with the proposed API, the application should
> be able to convert pid of a process or tid(gettid()) of a thread as well.
>


Can you explain what Oracle's database is planning to do with this information?

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Andy Lutomirski

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:
>
>
> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:
>>
>>
>>
>> On 10/16/2017 02:36 PM, Andrew Morton wrote:
>>>
>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>>>  wrote:
>>>
>>> pid_t translate_pid(pid_t pid, int source, int target);
>>>
>>> This syscall converts pid from source pid-ns into pid in target
>>> pid-ns.
>>> If pid is unreachable from target pid-ns it returns zero.
>>>
>>> Pid-namespaces are referred file descriptors opened to proc files
>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
>>> argument
>>> refers to current pid namespace, same as file /proc/self/ns/pid.
>>>
>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
>>> translation requires scanning all tasks. Also pids could be
>>> translated
>>> by sending them through unix socket between namespaces, this method
>>> is
>>> slow and insecure because other side is exposed inside pid namespace.

 Andrew asked why we might need this.

 Such conversion is required for interaction between processes across
 pid-namespaces.
 For example to identify process in container by pid file looking from
 outside.

 Two years ago I've solved this in project of mine with monstrous code
 which
 forks couple times just to convert pid, lucky for me performance wasn't
 important.
>>>
>>> That's a single user who needed this a single time, and found a
>>> userspace-based solution anyway.  This is not exactly compelling!
>>>
>>> Is there a stronger case to be made?  How does this change benefit our
>>> users?  Sell it to us!
>>
>> Oracle database is planning to use pid namespace for sandboxing database
>> instances and they need an API similar to translate_pid to effectively
>> translate process IDs from other pid namespaces. Prakash (cced in mail) can
>> provide more details on this usecase.
>
>
> As Nagarathnam indicated, Oracle Database will be using pid namespaces and
> needs a direct method of converting pids of processes in the pid namespace
> hierarchy. In this use case multiple
> nested PID namespaces will be used.  The currently available mechanism are
> not very efficient for this use case. For ex. as Konstantin described, using
> /proc//status would require the application to scan all the pid's
> status files to determine the pid of given process in a child namespace.
>
> Use of SCM_CREDENTIALS's socket message is another way, which would require
> every process starting inside a pid namespace to send this message and the
> receiving process in the target namespace would have to save the converted
> pid and reference it. This mechanism becomes cumbersome especially if the
> application has to deal with multiple nested pid namespaces. Also, the
> Database needs to be able to convert a thread's global pid(gettid()).
> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
> CAP_SYS_ADMIN, which is an issue.
>
> So having a direct method, like the API that Konstantin is proposing, will
> work best for the Database
> since pid of a process in any of the nested pid namespaces can be converted
> as and when required. I think with the proposed API, the application should
> be able to convert pid of a process or tid(gettid()) of a thread as well.
>


Can you explain what Oracle's database is planning to do with this information?

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread prakash.sangappa




On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:
On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov 
 wrote:



pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target 
pid-ns.

If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative 
argument

refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be 
translated
by sending them through unix socket between namespaces, this 
method is
slow and insecure because other side is exposed inside pid 
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across 
pid-namespaces.
For example to identify process in container by pid file looking 
from outside.


Two years ago I've solved this in project of mine with monstrous 
code which
forks couple times just to convert pid, lucky for me performance 
wasn't important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!
Oracle database is planning to use pid namespace for sandboxing 
database instances and they need an API similar to translate_pid to 
effectively translate process IDs from other pid namespaces. Prakash 
(cced in mail) can provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces 
and needs a direct method of converting pids of processes in the pid 
namespace hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism 
are not very efficient for this use case. For ex. as Konstantin 
described, using /proc//status would require the application to 
scan all the pid's status files to determine the pid of given process in 
a child namespace.


Use of SCM_CREDENTIALS's socket message is another way, which would 
require every process starting inside a pid namespace to send this 
message and the receiving process in the target namespace would have to 
save the converted pid and reference it. This mechanism becomes 
cumbersome especially if the application has to deal with multiple 
nested pid namespaces. Also, the Database needs to be able to convert a 
thread's global pid(gettid()). Passing the thread's pid(gettid()) in 
SCM_CREDENTIALS message requires CAP_SYS_ADMIN, which is an issue.


So having a direct method, like the API that Konstantin is proposing, 
will work best for the Database
since pid of a process in any of the nested pid namespaces can be 
converted as and when required. I think with the proposed API, the 
application should be able to convert pid of a process or tid(gettid()) 
of a thread as well.


-Prakash


Thanks,
Nagarathnam.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread prakash.sangappa




On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:
On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov 
 wrote:



pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target 
pid-ns.

If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative 
argument

refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be 
translated
by sending them through unix socket between namespaces, this 
method is
slow and insecure because other side is exposed inside pid 
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across 
pid-namespaces.
For example to identify process in container by pid file looking 
from outside.


Two years ago I've solved this in project of mine with monstrous 
code which
forks couple times just to convert pid, lucky for me performance 
wasn't important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!
Oracle database is planning to use pid namespace for sandboxing 
database instances and they need an API similar to translate_pid to 
effectively translate process IDs from other pid namespaces. Prakash 
(cced in mail) can provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces 
and needs a direct method of converting pids of processes in the pid 
namespace hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism 
are not very efficient for this use case. For ex. as Konstantin 
described, using /proc//status would require the application to 
scan all the pid's status files to determine the pid of given process in 
a child namespace.


Use of SCM_CREDENTIALS's socket message is another way, which would 
require every process starting inside a pid namespace to send this 
message and the receiving process in the target namespace would have to 
save the converted pid and reference it. This mechanism becomes 
cumbersome especially if the application has to deal with multiple 
nested pid namespaces. Also, the Database needs to be able to convert a 
thread's global pid(gettid()). Passing the thread's pid(gettid()) in 
SCM_CREDENTIALS message requires CAP_SYS_ADMIN, which is an issue.


So having a direct method, like the API that Konstantin is proposing, 
will work best for the Database
since pid of a process in any of the nested pid namespaces can be 
converted as and when required. I think with the proposed API, the 
application should be able to convert pid of a process or tid(gettid()) 
of a thread as well.


-Prakash


Thanks,
Nagarathnam.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Nagarathnam Muthusamy




On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov 
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across 
pid-namespaces.
For example to identify process in container by pid file looking from outside.

Two years ago I've solved this in project of mine with monstrous code which
forks couple times just to convert pid, lucky for me performance wasn't 
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!
Oracle database is planning to use pid namespace for sandboxing database 
instances and they need an API similar to translate_pid to effectively 
translate process IDs from other pid namespaces. Prakash (cced in mail) 
can provide more details on this usecase.


Thanks,
Nagarathnam.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Nagarathnam Muthusamy




On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov 
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across 
pid-namespaces.
For example to identify process in container by pid file looking from outside.

Two years ago I've solved this in project of mine with monstrous code which
forks couple times just to convert pid, lucky for me performance wasn't 
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!
Oracle database is planning to use pid namespace for sandboxing database 
instances and they need an API similar to translate_pid to effectively 
translate process IDs from other pid namespaces. Prakash (cced in mail) 
can provide more details on this usecase.


Thanks,
Nagarathnam.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Andrew Morton

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov 
 wrote:

> >>> pid_t translate_pid(pid_t pid, int source, int target);
> >>>
> >>> This syscall converts pid from source pid-ns into pid in target pid-ns.
> >>> If pid is unreachable from target pid-ns it returns zero.
> >>>
> >>> Pid-namespaces are referred file descriptors opened to proc files
> >>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
> >>> refers to current pid namespace, same as file /proc/self/ns/pid.
> >>>
> >>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
> >>> translation requires scanning all tasks. Also pids could be translated
> >>> by sending them through unix socket between namespaces, this method is
> >>> slow and insecure because other side is exposed inside pid namespace.
> 
> Andrew asked why we might need this.
> 
> Such conversion is required for interaction between processes across 
> pid-namespaces.
> For example to identify process in container by pid file looking from outside.
> 
> Two years ago I've solved this in project of mine with monstrous code which
> forks couple times just to convert pid, lucky for me performance wasn't 
> important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Andrew Morton

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov 
 wrote:

> >>> pid_t translate_pid(pid_t pid, int source, int target);
> >>>
> >>> This syscall converts pid from source pid-ns into pid in target pid-ns.
> >>> If pid is unreachable from target pid-ns it returns zero.
> >>>
> >>> Pid-namespaces are referred file descriptors opened to proc files
> >>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
> >>> refers to current pid namespace, same as file /proc/self/ns/pid.
> >>>
> >>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
> >>> translation requires scanning all tasks. Also pids could be translated
> >>> by sending them through unix socket between namespaces, this method is
> >>> slow and insecure because other side is exposed inside pid namespace.
> 
> Andrew asked why we might need this.
> 
> Such conversion is required for interaction between processes across 
> pid-namespaces.
> For example to identify process in container by pid file looking from outside.
> 
> Two years ago I've solved this in project of mine with monstrous code which
> forks couple times just to convert pid, lucky for me performance wasn't 
> important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Nagarathnam Muthusamy




On 10/16/2017 09:24 AM, Oleg Nesterov wrote:

On 10/13, Konstantin Khlebnikov wrote:


On 13.10.2017 19:05, Oleg Nesterov wrote:

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

That was in v3.

I'll prefer to this later, separately. And replace fget with fdget which
allows to do this without atomic operations if task is single-threaded.

OK, agreed,


Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}

Yes, this is more limited... Do you have a use-case when this is not enough?

That was in v1 but considered too racy.

Hmm, I don't understand...

Yes sure, this is racy but open("/proc/$pid/ns/pid") is racy too?

OK, once you do fd=open("/proc/$pid/ns/pid") you can use this fd even after
its owner exits, while find_task_by_vpid() will fail or find another task if
this pid was already reused.

But once again, do you have a use-case when this is important?


I believe that in V1 Eric pointed out that pid in general is not a clean 
way to represent
namespace. (https://lkml.org/lkml/2015/9/22/1087) Few old interfaces 
used pid only because at that time there was no better way to represent 
namespaces.





But we could merge both ways:

source >= 0 - pidns fs
source < 0  - task_pid = -source

But for what? I must have missed something...

Oleg.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Nagarathnam Muthusamy




On 10/16/2017 09:24 AM, Oleg Nesterov wrote:

On 10/13, Konstantin Khlebnikov wrote:


On 13.10.2017 19:05, Oleg Nesterov wrote:

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

That was in v3.

I'll prefer to this later, separately. And replace fget with fdget which
allows to do this without atomic operations if task is single-threaded.

OK, agreed,


Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}

Yes, this is more limited... Do you have a use-case when this is not enough?

That was in v1 but considered too racy.

Hmm, I don't understand...

Yes sure, this is racy but open("/proc/$pid/ns/pid") is racy too?

OK, once you do fd=open("/proc/$pid/ns/pid") you can use this fd even after
its owner exits, while find_task_by_vpid() will fail or find another task if
this pid was already reused.

But once again, do you have a use-case when this is important?


I believe that in V1 Eric pointed out that pid in general is not a clean 
way to represent
namespace. (https://lkml.org/lkml/2015/9/22/1087) Few old interfaces 
used pid only because at that time there was no better way to represent 
namespaces.





But we could merge both ways:

source >= 0 - pidns fs
source < 0  - task_pid = -source

But for what? I must have missed something...

Oleg.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Oleg Nesterov

On 10/13, Konstantin Khlebnikov wrote:
>
>
> On 13.10.2017 19:05, Oleg Nesterov wrote:
> >
> >I won't insist, but this suggests we should add a new helper,
> >get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
> >as well.
>
> That was in v3.
>
> I'll prefer to this later, separately. And replace fget with fdget which
> allows to do this without atomic operations if task is single-threaded.

OK, agreed,

> >Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
> >I mean,
> >
> > sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
> > {
> > struct pid_namespace *source_ns, *target_ns;
> >
> > source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
> > target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));
> >
> > ...
> > }
> > > Yes, this is more limited... Do you have a use-case when this is not 
> > > enough?
>
> That was in v1 but considered too racy.

Hmm, I don't understand...

Yes sure, this is racy but open("/proc/$pid/ns/pid") is racy too?

OK, once you do fd=open("/proc/$pid/ns/pid") you can use this fd even after
its owner exits, while find_task_by_vpid() will fail or find another task if
this pid was already reused.

But once again, do you have a use-case when this is important?

> But we could merge both ways:
>
> source >= 0 - pidns fs
> source < 0  - task_pid = -source

But for what? I must have missed something...

Oleg.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-16 Thread Oleg Nesterov

On 10/13, Konstantin Khlebnikov wrote:
>
>
> On 13.10.2017 19:05, Oleg Nesterov wrote:
> >
> >I won't insist, but this suggests we should add a new helper,
> >get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
> >as well.
>
> That was in v3.
>
> I'll prefer to this later, separately. And replace fget with fdget which
> allows to do this without atomic operations if task is single-threaded.

OK, agreed,

> >Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
> >I mean,
> >
> > sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
> > {
> > struct pid_namespace *source_ns, *target_ns;
> >
> > source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
> > target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));
> >
> > ...
> > }
> > > Yes, this is more limited... Do you have a use-case when this is not 
> > > enough?
>
> That was in v1 but considered too racy.

Hmm, I don't understand...

Yes sure, this is racy but open("/proc/$pid/ns/pid") is racy too?

OK, once you do fd=open("/proc/$pid/ns/pid") you can use this fd even after
its owner exits, while find_task_by_vpid() will fail or find another task if
this pid was already reused.

But once again, do you have a use-case when this is important?

> But we could merge both ways:
>
> source >= 0 - pidns fs
> source < 0  - task_pid = -source

But for what? I must have missed something...

Oleg.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-14 Thread Konstantin Khlebnikov

On 13.10.2017 19:13, Konstantin Khlebnikov wrote:

On 13.10.2017 19:05, Oleg Nesterov wrote:

On 10/13, Konstantin Khlebnikov wrote:

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across 
pid-namespaces.
For example to identify process in container by pid file looking from outside.

Two years ago I've solved this in project of mine with monstrous code which
forks couple times just to convert pid, lucky for me performance wasn't 
important.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Add Eugene, strace probably wants this too.

I have a vague feeling we have already discussed this in the past, but
I can't recall anything...

Yeah, v3 was two years ago.

+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+struct pid_namespace *pidns;
+struct ns_common *ns;
+struct file *file;
+
+file = proc_ns_fget(fd);
+if (IS_ERR(file))
+return ERR_CAST(file);
+
+ns = get_proc_ns(file_inode(file));
+if (ns->ops->type == CLONE_NEWPID)
+pidns = get_pid_ns(to_pid_ns(ns));
+else
+pidns = ERR_PTR(-EINVAL);
+
+fput(file);
+return pidns;
+}

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

That was in v3.

I'll prefer to this later, separately. And replace fget with fdget which
allows to do this without atomic operations if task is single-threaded.

+SYSCALL_DEFINE3(translate_pid, pid_t, pid, int, source, int, target)
+{
+struct pid_namespace *source_ns, *target_ns;
+struct pid *struct_pid;
+pid_t result;
+
+if (source >= 0) {
+source_ns = get_pid_ns_by_fd(source);
+result = PTR_ERR(source_ns);
+if (IS_ERR(source_ns))
+goto err_source;
+} else
+source_ns = task_active_pid_ns(current);
+
+if (target >= 0) {
+target_ns = get_pid_ns_by_fd(target);
+result = PTR_ERR(target_ns);
+if (IS_ERR(target_ns))
+goto err_target;
+} else
+target_ns = task_active_pid_ns(current);
+
+rcu_read_lock();
+struct_pid = find_pid_ns(pid, source_ns);
+result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
+rcu_read_unlock();

Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}
 > Yes, this is more limited... Do you have a use-case when this is not enough?

That was in v1 but considered too racy.

But we could merge both ways:

source >= 0 - pidns fs
source < 0  - task_pid = -source

Then "-1" points to init task in current pidns
which obviously lives in current pidns too,
thus lookup isn't required:

if (source >= 0)
source_ns = get_pid_ns_by_fd(source);
else if (source == -1)
source_ns = task_active_pid_ns(current);
else
source_ns = task_active_pid_ns(find_task_by_vpid(-source));

 >> v1: https://lkml.org/lkml/2015/9/15/411
 >> v2: https://lkml.org/lkml/2015/9/24/278
 >> v3: https://lkml.org/lkml/2015/9/28/3

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-14 Thread Konstantin Khlebnikov

On 13.10.2017 19:13, Konstantin Khlebnikov wrote:

On 13.10.2017 19:05, Oleg Nesterov wrote:

On 10/13, Konstantin Khlebnikov wrote:

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across 
pid-namespaces.
For example to identify process in container by pid file looking from outside.

Two years ago I've solved this in project of mine with monstrous code which
forks couple times just to convert pid, lucky for me performance wasn't 
important.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Add Eugene, strace probably wants this too.

I have a vague feeling we have already discussed this in the past, but
I can't recall anything...

Yeah, v3 was two years ago.

+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+struct pid_namespace *pidns;
+struct ns_common *ns;
+struct file *file;
+
+file = proc_ns_fget(fd);
+if (IS_ERR(file))
+return ERR_CAST(file);
+
+ns = get_proc_ns(file_inode(file));
+if (ns->ops->type == CLONE_NEWPID)
+pidns = get_pid_ns(to_pid_ns(ns));
+else
+pidns = ERR_PTR(-EINVAL);
+
+fput(file);
+return pidns;
+}

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

That was in v3.

I'll prefer to this later, separately. And replace fget with fdget which
allows to do this without atomic operations if task is single-threaded.

+SYSCALL_DEFINE3(translate_pid, pid_t, pid, int, source, int, target)
+{
+struct pid_namespace *source_ns, *target_ns;
+struct pid *struct_pid;
+pid_t result;
+
+if (source >= 0) {
+source_ns = get_pid_ns_by_fd(source);
+result = PTR_ERR(source_ns);
+if (IS_ERR(source_ns))
+goto err_source;
+} else
+source_ns = task_active_pid_ns(current);
+
+if (target >= 0) {
+target_ns = get_pid_ns_by_fd(target);
+result = PTR_ERR(target_ns);
+if (IS_ERR(target_ns))
+goto err_target;
+} else
+target_ns = task_active_pid_ns(current);
+
+rcu_read_lock();
+struct_pid = find_pid_ns(pid, source_ns);
+result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
+rcu_read_unlock();

Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}
 > Yes, this is more limited... Do you have a use-case when this is not enough?

That was in v1 but considered too racy.

But we could merge both ways:

source >= 0 - pidns fs
source < 0  - task_pid = -source

Then "-1" points to init task in current pidns
which obviously lives in current pidns too,
thus lookup isn't required:

if (source >= 0)
source_ns = get_pid_ns_by_fd(source);
else if (source == -1)
source_ns = task_active_pid_ns(current);
else
source_ns = task_active_pid_ns(find_task_by_vpid(-source));

 >> v1: https://lkml.org/lkml/2015/9/15/411
 >> v2: https://lkml.org/lkml/2015/9/24/278
 >> v3: https://lkml.org/lkml/2015/9/28/3

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-13 Thread Konstantin Khlebnikov

On 13.10.2017 19:05, Oleg Nesterov wrote:

On 10/13, Konstantin Khlebnikov wrote:

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Add Eugene, strace probably wants this too.

I have a vague feeling we have already discussed this in the past, but
I can't recall anything...

Yeah, v3 was two years ago.

+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+   struct pid_namespace *pidns;
+   struct ns_common *ns;
+   struct file *file;
+
+   file = proc_ns_fget(fd);
+   if (IS_ERR(file))
+   return ERR_CAST(file);
+
+   ns = get_proc_ns(file_inode(file));
+   if (ns->ops->type == CLONE_NEWPID)
+   pidns = get_pid_ns(to_pid_ns(ns));
+   else
+   pidns = ERR_PTR(-EINVAL);
+
+   fput(file);
+   return pidns;
+}

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

That was in v3.

I'll prefer to this later, separately. And replace fget with fdget which
allows to do this without atomic operations if task is single-threaded.

+SYSCALL_DEFINE3(translate_pid, pid_t, pid, int, source, int, target)
+{
+   struct pid_namespace *source_ns, *target_ns;
+   struct pid *struct_pid;
+   pid_t result;
+
+   if (source >= 0) {
+   source_ns = get_pid_ns_by_fd(source);
+   result = PTR_ERR(source_ns);
+   if (IS_ERR(source_ns))
+   goto err_source;
+   } else
+   source_ns = task_active_pid_ns(current);
+
+   if (target >= 0) {
+   target_ns = get_pid_ns_by_fd(target);
+   result = PTR_ERR(target_ns);
+   if (IS_ERR(target_ns))
+   goto err_target;
+   } else
+   target_ns = task_active_pid_ns(current);
+
+   rcu_read_lock();
+   struct_pid = find_pid_ns(pid, source_ns);
+   result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
+   rcu_read_unlock();

Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}
 > Yes, this is more limited... Do you have a use-case when this is not enough?

That was in v1 but considered too racy.

>> v1: https://lkml.org/lkml/2015/9/15/411
>> v2: https://lkml.org/lkml/2015/9/24/278
>> v3: https://lkml.org/lkml/2015/9/28/3

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-13 Thread Konstantin Khlebnikov

On 13.10.2017 19:05, Oleg Nesterov wrote:

On 10/13, Konstantin Khlebnikov wrote:

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Add Eugene, strace probably wants this too.

I have a vague feeling we have already discussed this in the past, but
I can't recall anything...

Yeah, v3 was two years ago.

+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+   struct pid_namespace *pidns;
+   struct ns_common *ns;
+   struct file *file;
+
+   file = proc_ns_fget(fd);
+   if (IS_ERR(file))
+   return ERR_CAST(file);
+
+   ns = get_proc_ns(file_inode(file));
+   if (ns->ops->type == CLONE_NEWPID)
+   pidns = get_pid_ns(to_pid_ns(ns));
+   else
+   pidns = ERR_PTR(-EINVAL);
+
+   fput(file);
+   return pidns;
+}

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

That was in v3.

I'll prefer to this later, separately. And replace fget with fdget which
allows to do this without atomic operations if task is single-threaded.

+SYSCALL_DEFINE3(translate_pid, pid_t, pid, int, source, int, target)
+{
+   struct pid_namespace *source_ns, *target_ns;
+   struct pid *struct_pid;
+   pid_t result;
+
+   if (source >= 0) {
+   source_ns = get_pid_ns_by_fd(source);
+   result = PTR_ERR(source_ns);
+   if (IS_ERR(source_ns))
+   goto err_source;
+   } else
+   source_ns = task_active_pid_ns(current);
+
+   if (target >= 0) {
+   target_ns = get_pid_ns_by_fd(target);
+   result = PTR_ERR(target_ns);
+   if (IS_ERR(target_ns))
+   goto err_target;
+   } else
+   target_ns = task_active_pid_ns(current);
+
+   rcu_read_lock();
+   struct_pid = find_pid_ns(pid, source_ns);
+   result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
+   rcu_read_unlock();

Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}
 > Yes, this is more limited... Do you have a use-case when this is not enough?

That was in v1 but considered too racy.

>> v1: https://lkml.org/lkml/2015/9/15/411
>> v2: https://lkml.org/lkml/2015/9/24/278
>> v3: https://lkml.org/lkml/2015/9/28/3

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-13 Thread Oleg Nesterov

On 10/13, Konstantin Khlebnikov wrote:
>
> pid_t translate_pid(pid_t pid, int source, int target);
>
> This syscall converts pid from source pid-ns into pid in target pid-ns.
> If pid is unreachable from target pid-ns it returns zero.
>
> Pid-namespaces are referred file descriptors opened to proc files
> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
> refers to current pid namespace, same as file /proc/self/ns/pid.
>
> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
> translation requires scanning all tasks. Also pids could be translated
> by sending them through unix socket between namespaces, this method is
> slow and insecure because other side is exposed inside pid namespace.
>
> Examples:
> translate_pid(pid, ns, -1)  - get pid in our pid namespace
> translate_pid(pid, -1, ns)  - get pid in other pid namespace
> translate_pid(1, ns, -1)- get pid of init task for namespace
> translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
> translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
> translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
> translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Add Eugene, strace probably wants this too.

I have a vague feeling we have already discussed this in the past, but
I can't recall anything...

> +static struct pid_namespace *get_pid_ns_by_fd(int fd)
> +{
> + struct pid_namespace *pidns;
> + struct ns_common *ns;
> + struct file *file;
> +
> + file = proc_ns_fget(fd);
> + if (IS_ERR(file))
> + return ERR_CAST(file);
> +
> + ns = get_proc_ns(file_inode(file));
> + if (ns->ops->type == CLONE_NEWPID)
> + pidns = get_pid_ns(to_pid_ns(ns));
> + else
> + pidns = ERR_PTR(-EINVAL);
> +
> + fput(file);
> + return pidns;
> +}

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, int, source, int, target)
> +{
> + struct pid_namespace *source_ns, *target_ns;
> + struct pid *struct_pid;
> + pid_t result;
> +
> + if (source >= 0) {
> + source_ns = get_pid_ns_by_fd(source);
> + result = PTR_ERR(source_ns);
> + if (IS_ERR(source_ns))
> + goto err_source;
> + } else
> + source_ns = task_active_pid_ns(current);
> +
> + if (target >= 0) {
> + target_ns = get_pid_ns_by_fd(target);
> + result = PTR_ERR(target_ns);
> + if (IS_ERR(target_ns))
> + goto err_target;
> + } else
> + target_ns = task_active_pid_ns(current);
> +
> + rcu_read_lock();
> + struct_pid = find_pid_ns(pid, source_ns);
> + result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
> + rcu_read_unlock();

Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}

Yes, this is more limited... Do you have a use-case when this is not enough?

Oleg.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-13 Thread Oleg Nesterov

On 10/13, Konstantin Khlebnikov wrote:
>
> pid_t translate_pid(pid_t pid, int source, int target);
>
> This syscall converts pid from source pid-ns into pid in target pid-ns.
> If pid is unreachable from target pid-ns it returns zero.
>
> Pid-namespaces are referred file descriptors opened to proc files
> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
> refers to current pid namespace, same as file /proc/self/ns/pid.
>
> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
> translation requires scanning all tasks. Also pids could be translated
> by sending them through unix socket between namespaces, this method is
> slow and insecure because other side is exposed inside pid namespace.
>
> Examples:
> translate_pid(pid, ns, -1)  - get pid in our pid namespace
> translate_pid(pid, -1, ns)  - get pid in other pid namespace
> translate_pid(1, ns, -1)- get pid of init task for namespace
> translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
> translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
> translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
> translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Add Eugene, strace probably wants this too.

I have a vague feeling we have already discussed this in the past, but
I can't recall anything...

> +static struct pid_namespace *get_pid_ns_by_fd(int fd)
> +{
> + struct pid_namespace *pidns;
> + struct ns_common *ns;
> + struct file *file;
> +
> + file = proc_ns_fget(fd);
> + if (IS_ERR(file))
> + return ERR_CAST(file);
> +
> + ns = get_proc_ns(file_inode(file));
> + if (ns->ops->type == CLONE_NEWPID)
> + pidns = get_pid_ns(to_pid_ns(ns));
> + else
> + pidns = ERR_PTR(-EINVAL);
> +
> + fput(file);
> + return pidns;
> +}

I won't insist, but this suggests we should add a new helper,
get_ns_by_fd_type(fd, type), and convert get_net_ns_by_fd() to use it
as well.

> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, int, source, int, target)
> +{
> + struct pid_namespace *source_ns, *target_ns;
> + struct pid *struct_pid;
> + pid_t result;
> +
> + if (source >= 0) {
> + source_ns = get_pid_ns_by_fd(source);
> + result = PTR_ERR(source_ns);
> + if (IS_ERR(source_ns))
> + goto err_source;
> + } else
> + source_ns = task_active_pid_ns(current);
> +
> + if (target >= 0) {
> + target_ns = get_pid_ns_by_fd(target);
> + result = PTR_ERR(target_ns);
> + if (IS_ERR(target_ns))
> + goto err_target;
> + } else
> + target_ns = task_active_pid_ns(current);
> +
> + rcu_read_lock();
> + struct_pid = find_pid_ns(pid, source_ns);
> + result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
> + rcu_read_unlock();

Stupid question. Can't we make a simpler API which doesn't need /proc/ ?
I mean,

sys_translate_pid(pid_t pid, pid_t source_pid, pid_t target_pid)
{
struct pid_namespace *source_ns, *target_ns;

source_ns = task_active_pid_ns(find_task_by_vpid(source_pid));
target_ns = task_active_pid_ns(find_task_by_vpid(target_pid));

...
}

Yes, this is more limited... Do you have a use-case when this is not enough?

Oleg.

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-13 Thread Konstantin Khlebnikov


Sample tool in attachment

On 13.10.2017 12:26, Konstantin Khlebnikov wrote:

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Error codes:
EBADF- file descriptor is closed
EINVAL   - file descriptor isn't pid-namespace
ESRCH- task not found in @source namespace

Signed-off-by: Konstantin Khlebnikov 

---

v1: https://lkml.org/lkml/2015/9/15/411
v2: https://lkml.org/lkml/2015/9/24/278
  * use namespace-fd as second/third argument
  * add -pid for getting parent pid
  * move code into kernel/sys.c next to getppid
  * drop ifdef CONFIG_PID_NS
  * add generic syscall
v3: https://lkml.org/lkml/2015/9/28/3
  * use proc_ns_fdget()
  * update description
  * rebase to next-20150925
  * fix conflict with mlock2
v4:
  * rename into translate_pid()
  * remove syscall if CONFIG_PID_NS=n
  * drop -pid for parent task
  * drop fget-fdget optimizations
  * add helper get_pid_ns_by_fd()
  * wire only into x86
---
  arch/x86/entry/syscalls/syscall_32.tbl |1
  arch/x86/entry/syscalls/syscall_64.tbl |1
  include/linux/syscalls.h   |1
  kernel/pid_namespace.c |   66 
  kernel/sys_ni.c|3 +
  5 files changed, 72 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..257d839b3a91 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
  382   i386pkey_free   sys_pkey_free
  383   i386statx   sys_statx
  384   i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386translate_pid   sys_translate_pid
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..1ebdab83c6f4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
  330   common  pkey_alloc  sys_pkey_alloc
  331   common  pkey_free   sys_pkey_free
  332   common  statx   sys_statx
+333common  translate_pid   sys_translate_pid
  
  #

  # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d826d7..6467ebc847c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
  struct file_handle __user *handle,
  int flags);
  asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_translate_pid(pid_t pid, int source, int target);
  asmlinkage long sys_process_vm_readv(pid_t pid,
 const struct iovec __user *lvec,
 unsigned long liovcnt,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 4918314893bc..062f35eedd41 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -13,6 +13,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -406,6 +407,71 @@ static void pidns_put(struct ns_common *ns)
put_pid_ns(to_pid_ns(ns));
  }
  
+static struct pid_namespace *get_pid_ns_by_fd(int fd)

+{
+   struct pid_namespace *pidns;
+   struct ns_common *ns;
+   struct file *file;
+
+   file = proc_ns_fget(fd);
+   if (IS_ERR(file))
+   return ERR_CAST(file);
+
+   ns = get_proc_ns(file_inode(file));
+   if (ns->ops->type == CLONE_NEWPID)
+   pidns = get_pid_ns(to_pid_ns(ns));
+   else
+   pidns = ERR_PTR(-EINVAL);
+
+   fput(file);
+   return pidns;
+}
+
+/*
+ * translate_pid - convert pid in source pid-ns into target pid-ns.
+ * @pid:pid for translation
+ *

Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-13 Thread Konstantin Khlebnikov


Sample tool in attachment

On 13.10.2017 12:26, Konstantin Khlebnikov wrote:

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Error codes:
EBADF- file descriptor is closed
EINVAL   - file descriptor isn't pid-namespace
ESRCH- task not found in @source namespace

Signed-off-by: Konstantin Khlebnikov 

---

v1: https://lkml.org/lkml/2015/9/15/411
v2: https://lkml.org/lkml/2015/9/24/278
  * use namespace-fd as second/third argument
  * add -pid for getting parent pid
  * move code into kernel/sys.c next to getppid
  * drop ifdef CONFIG_PID_NS
  * add generic syscall
v3: https://lkml.org/lkml/2015/9/28/3
  * use proc_ns_fdget()
  * update description
  * rebase to next-20150925
  * fix conflict with mlock2
v4:
  * rename into translate_pid()
  * remove syscall if CONFIG_PID_NS=n
  * drop -pid for parent task
  * drop fget-fdget optimizations
  * add helper get_pid_ns_by_fd()
  * wire only into x86
---
  arch/x86/entry/syscalls/syscall_32.tbl |1
  arch/x86/entry/syscalls/syscall_64.tbl |1
  include/linux/syscalls.h   |1
  kernel/pid_namespace.c |   66 
  kernel/sys_ni.c|3 +
  5 files changed, 72 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..257d839b3a91 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
  382   i386pkey_free   sys_pkey_free
  383   i386statx   sys_statx
  384   i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386translate_pid   sys_translate_pid
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..1ebdab83c6f4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
  330   common  pkey_alloc  sys_pkey_alloc
  331   common  pkey_free   sys_pkey_free
  332   common  statx   sys_statx
+333common  translate_pid   sys_translate_pid
  
  #

  # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d826d7..6467ebc847c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
  struct file_handle __user *handle,
  int flags);
  asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_translate_pid(pid_t pid, int source, int target);
  asmlinkage long sys_process_vm_readv(pid_t pid,
 const struct iovec __user *lvec,
 unsigned long liovcnt,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 4918314893bc..062f35eedd41 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -13,6 +13,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -406,6 +407,71 @@ static void pidns_put(struct ns_common *ns)
put_pid_ns(to_pid_ns(ns));
  }
  
+static struct pid_namespace *get_pid_ns_by_fd(int fd)

+{
+   struct pid_namespace *pidns;
+   struct ns_common *ns;
+   struct file *file;
+
+   file = proc_ns_fget(fd);
+   if (IS_ERR(file))
+   return ERR_CAST(file);
+
+   ns = get_proc_ns(file_inode(file));
+   if (ns->ops->type == CLONE_NEWPID)
+   pidns = get_pid_ns(to_pid_ns(ns));
+   else
+   pidns = ERR_PTR(-EINVAL);
+
+   fput(file);
+   return pidns;
+}
+
+/*
+ * translate_pid - convert pid in source pid-ns into target pid-ns.
+ * @pid:pid for translation
+ * @source: pid-ns file

[PATCH v4] pidns: introduce syscall translate_pid

2017-10-13 Thread Konstantin Khlebnikov

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Error codes:
EBADF- file descriptor is closed
EINVAL   - file descriptor isn't pid-namespace
ESRCH- task not found in @source namespace

Signed-off-by: Konstantin Khlebnikov 

---

v1: https://lkml.org/lkml/2015/9/15/411
v2: https://lkml.org/lkml/2015/9/24/278
 * use namespace-fd as second/third argument
 * add -pid for getting parent pid
 * move code into kernel/sys.c next to getppid
 * drop ifdef CONFIG_PID_NS
 * add generic syscall
v3: https://lkml.org/lkml/2015/9/28/3
 * use proc_ns_fdget()
 * update description
 * rebase to next-20150925
 * fix conflict with mlock2
v4:
 * rename into translate_pid()
 * remove syscall if CONFIG_PID_NS=n
 * drop -pid for parent task
 * drop fget-fdget optimizations
 * add helper get_pid_ns_by_fd()
 * wire only into x86
---
 arch/x86/entry/syscalls/syscall_32.tbl |1 
 arch/x86/entry/syscalls/syscall_64.tbl |1 
 include/linux/syscalls.h   |1 
 kernel/pid_namespace.c |   66 
 kernel/sys_ni.c|3 +
 5 files changed, 72 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..257d839b3a91 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
 384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386translate_pid   sys_translate_pid
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..1ebdab83c6f4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330common  pkey_alloc  sys_pkey_alloc
 331common  pkey_free   sys_pkey_free
 332common  statx   sys_statx
+333common  translate_pid   sys_translate_pid
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d826d7..6467ebc847c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
  struct file_handle __user *handle,
  int flags);
 asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_translate_pid(pid_t pid, int source, int target);
 asmlinkage long sys_process_vm_readv(pid_t pid,
 const struct iovec __user *lvec,
 unsigned long liovcnt,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 4918314893bc..062f35eedd41 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -406,6 +407,71 @@ static void pidns_put(struct ns_common *ns)
put_pid_ns(to_pid_ns(ns));
 }
 
+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+   struct pid_namespace *pidns;
+   struct ns_common *ns;
+   struct file *file;
+
+   file = proc_ns_fget(fd);
+   if (IS_ERR(file))
+   return ERR_CAST(file);
+
+   ns = get_proc_ns(file_inode(file));
+   if (ns->ops->type == CLONE_NEWPID)
+   pidns = get_pid_ns(to_pid_ns(ns));
+   else
+   pidns = ERR_PTR(-EINVAL);
+
+   fput(file);
+   return pidns;
+}
+
+/*
+ * translate_pid - convert pid in source pid-ns into target pid-ns.
+ * @pid:pid for translation
+ * @source: pid-ns file descriptor or -1 for active namespace
+ * @target: pid-ns file descriptor or -1 for active

[PATCH v4] pidns: introduce syscall translate_pid

2017-10-13 Thread Konstantin Khlebnikov

pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be translated
by sending them through unix socket between namespaces, this method is
slow and insecure because other side is exposed inside pid namespace.

Examples:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Error codes:
EBADF- file descriptor is closed
EINVAL   - file descriptor isn't pid-namespace
ESRCH- task not found in @source namespace

Signed-off-by: Konstantin Khlebnikov 

---

v1: https://lkml.org/lkml/2015/9/15/411
v2: https://lkml.org/lkml/2015/9/24/278
 * use namespace-fd as second/third argument
 * add -pid for getting parent pid
 * move code into kernel/sys.c next to getppid
 * drop ifdef CONFIG_PID_NS
 * add generic syscall
v3: https://lkml.org/lkml/2015/9/28/3
 * use proc_ns_fdget()
 * update description
 * rebase to next-20150925
 * fix conflict with mlock2
v4:
 * rename into translate_pid()
 * remove syscall if CONFIG_PID_NS=n
 * drop -pid for parent task
 * drop fget-fdget optimizations
 * add helper get_pid_ns_by_fd()
 * wire only into x86
---
 arch/x86/entry/syscalls/syscall_32.tbl |1 
 arch/x86/entry/syscalls/syscall_64.tbl |1 
 include/linux/syscalls.h   |1 
 kernel/pid_namespace.c |   66 
 kernel/sys_ni.c|3 +
 5 files changed, 72 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..257d839b3a91 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
 384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386translate_pid   sys_translate_pid
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..1ebdab83c6f4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330common  pkey_alloc  sys_pkey_alloc
 331common  pkey_free   sys_pkey_free
 332common  statx   sys_statx
+333common  translate_pid   sys_translate_pid
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d826d7..6467ebc847c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
  struct file_handle __user *handle,
  int flags);
 asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_translate_pid(pid_t pid, int source, int target);
 asmlinkage long sys_process_vm_readv(pid_t pid,
 const struct iovec __user *lvec,
 unsigned long liovcnt,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 4918314893bc..062f35eedd41 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -406,6 +407,71 @@ static void pidns_put(struct ns_common *ns)
put_pid_ns(to_pid_ns(ns));
 }
 
+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+   struct pid_namespace *pidns;
+   struct ns_common *ns;
+   struct file *file;
+
+   file = proc_ns_fget(fd);
+   if (IS_ERR(file))
+   return ERR_CAST(file);
+
+   ns = get_proc_ns(file_inode(file));
+   if (ns->ops->type == CLONE_NEWPID)
+   pidns = get_pid_ns(to_pid_ns(ns));
+   else
+   pidns = ERR_PTR(-EINVAL);
+
+   fput(file);
+   return pidns;
+}
+
+/*
+ * translate_pid - convert pid in source pid-ns into target pid-ns.
+ * @pid:pid for translation
+ * @source: pid-ns file descriptor or -1 for active namespace
+ * @target: pid-ns file descriptor or -1 for active namesapce
+ *
+ * Returns pid

52 matches

Mail list logo