Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces
On 07/26/2016 04:54 AM, Andrew Vagin wrote: On Mon, Jul 25, 2016 at 09:59:43AM -0500, Eric W. Biederman wrote: "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes: [snip] [snip] So, from my point of view, the important piece that was missing from your commit message was the note to use readlink("/proc/self/fd/%d") on the returned FDs. I think that detail needs to be part of the commit message (and also the man page text). I think it even be helpful to include the above program as part of the commit message: it helps people more quickly grasp the API. Please, please make the standard way to compare these things fstat. That is much less magic than a symlink, and a little more future proof. Possibly even kcmp. I like the idea to use kcmp to compare namespaces. I am going to add this functionality to kcmp and describe all these in the man page. Hi Andrey, Can you briefly sketch out the proposed API and how it would be used? I'd find it useful to see that even before the implementation. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces
On 07/26/2016 04:54 AM, Andrew Vagin wrote: On Mon, Jul 25, 2016 at 09:59:43AM -0500, Eric W. Biederman wrote: "Michael Kerrisk (man-pages)" writes: [snip] [snip] So, from my point of view, the important piece that was missing from your commit message was the note to use readlink("/proc/self/fd/%d") on the returned FDs. I think that detail needs to be part of the commit message (and also the man page text). I think it even be helpful to include the above program as part of the commit message: it helps people more quickly grasp the API. Please, please make the standard way to compare these things fstat. That is much less magic than a symlink, and a little more future proof. Possibly even kcmp. I like the idea to use kcmp to compare namespaces. I am going to add this functionality to kcmp and describe all these in the man page. Hi Andrey, Can you briefly sketch out the proposed API and how it would be used? I'd find it useful to see that even before the implementation. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces
Hi Eric, On 07/25/2016 03:18 PM, Eric W. Biederman wrote: "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes: Hi Andrey, On 07/22/2016 08:25 PM, Andrey Vagin wrote: On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages) <mtk.manpa...@gmail.com> wrote: Hi Andrey, On 07/21/2016 11:06 PM, Andrew Vagin wrote: On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages) wrote: Hi Andrey, On 07/14/2016 08:20 PM, Andrey Vagin wrote: Could you add here an of the API in detail: what do these FDs refer to, and how do you use them to solve the use case? And could you you add that info to the commit messages please. Hi Michael, A patch for man-pages is attached. It adds the following text to namespaces(7). Since Linux 4.X, the following ioctl(2) calls are supported for names‐ pace file descriptors. The correct syntax is: fd = ioctl(ns_fd, ioctl_type); where ioctl_type is one of the following: NS_GET_USERNS Returns a file descriptor that refers to an owning user names‐ pace. NS_GET_PARENT Returns a file descriptor that refers to a parent namespace. This ioctl(2) can be used for pid and user namespaces. For user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same mean‐ ing. For each of the above, I think it is worth mentioning that the close-on-exec flag is set for the returned file descriptor. Hmm. That is an odd default. Why do you say that? It's pretty common as the default for various APIs that create new FDs these days. (There's of course a strong argument that the original UNIX default was a design blunder...) In addition to generic ioctl(2) errors, the following specific ones can occur: EINVAL NS_GET_PARENT was called for a nonhierarchical namespace. EPERM The requested namespace is outside of the current namespace scope. Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial user namespace"? Having looked at that bit of code I don't think capabilities really have a role to play. Yes, I caught up with that now. I await to see how this plays out in the next patch version. ENOENT ns_fd refers to the init namespace. Thanks for this. But still part of the question remains unanswered. How do we (in user-space) use the file descriptors to answer any of the questions that this patch series was designed to solve? (This info should be in the commit message and the man-pages patch.) I'm sorry, but I am not sure that I understand what you ask. Here are the origin questions: Someone else then asked me a question that led me to wonder about generally introspecting on the parental relationships between user namespaces and the association of other namespaces types with user namespaces. One use would be visualization, in order to understand the running system. Another would be to answer the question I already mentioned: what capability does process X have to perform operations on a resource governed by namespace Y? Here is an example which shows how we can get the owning namespace inode number by using these ioctl-s. $ ls -l /proc/13929/ns/pid lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]' $ ./nsowner /proc/13929/ns/pid user:[4026532227] The owning user namespace for pid:[4026532228] is user:[4026532227]. The nsowner tool is cimpiled from this code: int main(int argc, char *argv[]) { char buf[128], path[] = "/proc/self/fd/0123456789"; int ns, uns, ret; ns = open(argv[1], O_RDONLY); if (ns < 0) return 1; uns = ioctl(ns, NS_GET_USERNS); if (uns < 0) return 1; snprintf(path, sizeof(path), "/proc/self/fd/%d", uns); ret = readlink(path, buf, sizeof(buf) - 1); if (ret < 0) return 1; buf[ret] = 0; printf("%s\n", buf); return 0; } So, from my point of view, the important piece that was missing from your commit message was the note to use readlink("/proc/self/fd/%d") on the returned FDs. I think that detail needs to be part of the commit message (and also the man page text). I think it even be helpful to include the above program as part of the commit message: it helps people more quickly grasp the API. Please, please make the standard way to compare these things fstat. That is much less magic than a symlink, and a little more future proof. Possibly even kcmp. As in fstat() to get the st_ino field, right? Cheers, Michael At some point we will care about migrating a migrating sub-container and we may have to have some minor changes. Eric -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces
Hi Eric, On 07/25/2016 03:18 PM, Eric W. Biederman wrote: "Michael Kerrisk (man-pages)" writes: Hi Andrey, On 07/22/2016 08:25 PM, Andrey Vagin wrote: On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages) wrote: Hi Andrey, On 07/21/2016 11:06 PM, Andrew Vagin wrote: On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages) wrote: Hi Andrey, On 07/14/2016 08:20 PM, Andrey Vagin wrote: Could you add here an of the API in detail: what do these FDs refer to, and how do you use them to solve the use case? And could you you add that info to the commit messages please. Hi Michael, A patch for man-pages is attached. It adds the following text to namespaces(7). Since Linux 4.X, the following ioctl(2) calls are supported for names‐ pace file descriptors. The correct syntax is: fd = ioctl(ns_fd, ioctl_type); where ioctl_type is one of the following: NS_GET_USERNS Returns a file descriptor that refers to an owning user names‐ pace. NS_GET_PARENT Returns a file descriptor that refers to a parent namespace. This ioctl(2) can be used for pid and user namespaces. For user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same mean‐ ing. For each of the above, I think it is worth mentioning that the close-on-exec flag is set for the returned file descriptor. Hmm. That is an odd default. Why do you say that? It's pretty common as the default for various APIs that create new FDs these days. (There's of course a strong argument that the original UNIX default was a design blunder...) In addition to generic ioctl(2) errors, the following specific ones can occur: EINVAL NS_GET_PARENT was called for a nonhierarchical namespace. EPERM The requested namespace is outside of the current namespace scope. Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial user namespace"? Having looked at that bit of code I don't think capabilities really have a role to play. Yes, I caught up with that now. I await to see how this plays out in the next patch version. ENOENT ns_fd refers to the init namespace. Thanks for this. But still part of the question remains unanswered. How do we (in user-space) use the file descriptors to answer any of the questions that this patch series was designed to solve? (This info should be in the commit message and the man-pages patch.) I'm sorry, but I am not sure that I understand what you ask. Here are the origin questions: Someone else then asked me a question that led me to wonder about generally introspecting on the parental relationships between user namespaces and the association of other namespaces types with user namespaces. One use would be visualization, in order to understand the running system. Another would be to answer the question I already mentioned: what capability does process X have to perform operations on a resource governed by namespace Y? Here is an example which shows how we can get the owning namespace inode number by using these ioctl-s. $ ls -l /proc/13929/ns/pid lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]' $ ./nsowner /proc/13929/ns/pid user:[4026532227] The owning user namespace for pid:[4026532228] is user:[4026532227]. The nsowner tool is cimpiled from this code: int main(int argc, char *argv[]) { char buf[128], path[] = "/proc/self/fd/0123456789"; int ns, uns, ret; ns = open(argv[1], O_RDONLY); if (ns < 0) return 1; uns = ioctl(ns, NS_GET_USERNS); if (uns < 0) return 1; snprintf(path, sizeof(path), "/proc/self/fd/%d", uns); ret = readlink(path, buf, sizeof(buf) - 1); if (ret < 0) return 1; buf[ret] = 0; printf("%s\n", buf); return 0; } So, from my point of view, the important piece that was missing from your commit message was the note to use readlink("/proc/self/fd/%d") on the returned FDs. I think that detail needs to be part of the commit message (and also the man page text). I think it even be helpful to include the above program as part of the commit message: it helps people more quickly grasp the API. Please, please make the standard way to compare these things fstat. That is much less magic than a symlink, and a little more future proof. Possibly even kcmp. As in fstat() to get the st_ino field, right? Cheers, Michael At some point we will care about migrating a migrating sub-container and we may have to have some minor changes. Eric -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces
Hi Andrey, On 07/22/2016 08:25 PM, Andrey Vagin wrote: On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages) <mtk.manpa...@gmail.com> wrote: Hi Andrey, On 07/21/2016 11:06 PM, Andrew Vagin wrote: On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages) wrote: Hi Andrey, On 07/14/2016 08:20 PM, Andrey Vagin wrote: Could you add here an of the API in detail: what do these FDs refer to, and how do you use them to solve the use case? And could you you add that info to the commit messages please. Hi Michael, A patch for man-pages is attached. It adds the following text to namespaces(7). Since Linux 4.X, the following ioctl(2) calls are supported for names‐ pace file descriptors. The correct syntax is: fd = ioctl(ns_fd, ioctl_type); where ioctl_type is one of the following: NS_GET_USERNS Returns a file descriptor that refers to an owning user names‐ pace. NS_GET_PARENT Returns a file descriptor that refers to a parent namespace. This ioctl(2) can be used for pid and user namespaces. For user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same mean‐ ing. For each of the above, I think it is worth mentioning that the close-on-exec flag is set for the returned file descriptor. In addition to generic ioctl(2) errors, the following specific ones can occur: EINVAL NS_GET_PARENT was called for a nonhierarchical namespace. EPERM The requested namespace is outside of the current namespace scope. Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial user namespace"? ENOENT ns_fd refers to the init namespace. Thanks for this. But still part of the question remains unanswered. How do we (in user-space) use the file descriptors to answer any of the questions that this patch series was designed to solve? (This info should be in the commit message and the man-pages patch.) I'm sorry, but I am not sure that I understand what you ask. Here are the origin questions: Someone else then asked me a question that led me to wonder about generally introspecting on the parental relationships between user namespaces and the association of other namespaces types with user namespaces. One use would be visualization, in order to understand the running system. Another would be to answer the question I already mentioned: what capability does process X have to perform operations on a resource governed by namespace Y? Here is an example which shows how we can get the owning namespace inode number by using these ioctl-s. $ ls -l /proc/13929/ns/pid lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]' $ ./nsowner /proc/13929/ns/pid user:[4026532227] The owning user namespace for pid:[4026532228] is user:[4026532227]. The nsowner tool is cimpiled from this code: int main(int argc, char *argv[]) { char buf[128], path[] = "/proc/self/fd/0123456789"; int ns, uns, ret; ns = open(argv[1], O_RDONLY); if (ns < 0) return 1; uns = ioctl(ns, NS_GET_USERNS); if (uns < 0) return 1; snprintf(path, sizeof(path), "/proc/self/fd/%d", uns); ret = readlink(path, buf, sizeof(buf) - 1); if (ret < 0) return 1; buf[ret] = 0; printf("%s\n", buf); return 0; } So, from my point of view, the important piece that was missing from your commit message was the note to use readlink("/proc/self/fd/%d") on the returned FDs. I think that detail needs to be part of the commit message (and also the man page text). I think it even be helpful to include the above program as part of the commit message: it helps people more quickly grasp the API. Does this example answer to the origin question? Yes. If it isn't, could you eloborate what you expect to see here. And I wrote one more example which show all relationships between namespaces. It enumirates all processes in a system, collects all namespaces and determins parent and owning namespaces for each of them, then it constructs a namespace tree and shows it. Here is a code: https://gist.github.com/avagin/db805f95e15ffb0af7e559dbb8de4418 That's great! Thanks! Here is an example of output for my test system: [root@fc24 nsfs]# ./nstree user:[4026531837] \__ mnt:[4026532203] \__ ipc:[4026531839] \__ user:[4026532224] \__ user:[4026532226] \__ user:[4026532227] \__ pid:[4026532228] \__ pid:[4026532225] \__ pid:[4026532228] \__ user:[4026532221] \__ pid:[402653] \__ user:[4026532223] \__ mnt:[4026532211] \__ uts:[4026531838] \__ cgroup:[4026531835] \__ pid:[4026531836] \__ pid:[4026532225] \__ pid:[4026532228] \__ pid:[402653] \__ mnt:[4026531857] \__ mnt:[4026531840] \__ net:[4026531957] Cheers, Michael [1]
Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces
Hi Andrey, On 07/22/2016 08:25 PM, Andrey Vagin wrote: On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages) wrote: Hi Andrey, On 07/21/2016 11:06 PM, Andrew Vagin wrote: On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages) wrote: Hi Andrey, On 07/14/2016 08:20 PM, Andrey Vagin wrote: Could you add here an of the API in detail: what do these FDs refer to, and how do you use them to solve the use case? And could you you add that info to the commit messages please. Hi Michael, A patch for man-pages is attached. It adds the following text to namespaces(7). Since Linux 4.X, the following ioctl(2) calls are supported for names‐ pace file descriptors. The correct syntax is: fd = ioctl(ns_fd, ioctl_type); where ioctl_type is one of the following: NS_GET_USERNS Returns a file descriptor that refers to an owning user names‐ pace. NS_GET_PARENT Returns a file descriptor that refers to a parent namespace. This ioctl(2) can be used for pid and user namespaces. For user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same mean‐ ing. For each of the above, I think it is worth mentioning that the close-on-exec flag is set for the returned file descriptor. In addition to generic ioctl(2) errors, the following specific ones can occur: EINVAL NS_GET_PARENT was called for a nonhierarchical namespace. EPERM The requested namespace is outside of the current namespace scope. Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial user namespace"? ENOENT ns_fd refers to the init namespace. Thanks for this. But still part of the question remains unanswered. How do we (in user-space) use the file descriptors to answer any of the questions that this patch series was designed to solve? (This info should be in the commit message and the man-pages patch.) I'm sorry, but I am not sure that I understand what you ask. Here are the origin questions: Someone else then asked me a question that led me to wonder about generally introspecting on the parental relationships between user namespaces and the association of other namespaces types with user namespaces. One use would be visualization, in order to understand the running system. Another would be to answer the question I already mentioned: what capability does process X have to perform operations on a resource governed by namespace Y? Here is an example which shows how we can get the owning namespace inode number by using these ioctl-s. $ ls -l /proc/13929/ns/pid lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]' $ ./nsowner /proc/13929/ns/pid user:[4026532227] The owning user namespace for pid:[4026532228] is user:[4026532227]. The nsowner tool is cimpiled from this code: int main(int argc, char *argv[]) { char buf[128], path[] = "/proc/self/fd/0123456789"; int ns, uns, ret; ns = open(argv[1], O_RDONLY); if (ns < 0) return 1; uns = ioctl(ns, NS_GET_USERNS); if (uns < 0) return 1; snprintf(path, sizeof(path), "/proc/self/fd/%d", uns); ret = readlink(path, buf, sizeof(buf) - 1); if (ret < 0) return 1; buf[ret] = 0; printf("%s\n", buf); return 0; } So, from my point of view, the important piece that was missing from your commit message was the note to use readlink("/proc/self/fd/%d") on the returned FDs. I think that detail needs to be part of the commit message (and also the man page text). I think it even be helpful to include the above program as part of the commit message: it helps people more quickly grasp the API. Does this example answer to the origin question? Yes. If it isn't, could you eloborate what you expect to see here. And I wrote one more example which show all relationships between namespaces. It enumirates all processes in a system, collects all namespaces and determins parent and owning namespaces for each of them, then it constructs a namespace tree and shows it. Here is a code: https://gist.github.com/avagin/db805f95e15ffb0af7e559dbb8de4418 That's great! Thanks! Here is an example of output for my test system: [root@fc24 nsfs]# ./nstree user:[4026531837] \__ mnt:[4026532203] \__ ipc:[4026531839] \__ user:[4026532224] \__ user:[4026532226] \__ user:[4026532227] \__ pid:[4026532228] \__ pid:[4026532225] \__ pid:[4026532228] \__ user:[4026532221] \__ pid:[402653] \__ user:[4026532223] \__ mnt:[4026532211] \__ uts:[4026531838] \__ cgroup:[4026531835] \__ pid:[4026531836] \__ pid:[4026532225] \__ pid:[4026532228] \__ pid:[402653] \__ mnt:[4026531857] \__ mnt:[4026531840] \__ net:[4026531957] Cheers, Michael [1] https://lkml.org/lkml/2016/7/6/15
Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces
Hi Andrey, On 07/14/2016 08:20 PM, Andrey Vagin wrote: Each namespace has an owning user namespace and now there is not way to discover these relationships. Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships too. Why we may want to know relationships between namespaces? One use would be visualization, in order to understand the running system. Another would be to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? One more use-case (which usually called abnormal) is checkpoint/restart. In CRIU we age going to dump and restore nested namespaces. There [1] was a discussion about which interface to choose to determing relationships between namespaces. Eric suggested to add two ioctl-s [2]: Grumble, Grumble. I think this may actually a case for creating ioctls for these two cases. Now that random nsfs file descriptors are bind mountable the original reason for using proc files is not as pressing. One ioctl for the user namespace that owns a file descriptor. One ioctl for the parent namespace of a namespace file descriptor. Here is an implementaions of these ioctl-s. Could you add here an of the API in detail: what do these FDs refer to, and how do you use them to solve the use case? And could you you add that info to the commit messages please. Thanks, Michael [1] https://lkml.org/lkml/2016/7/6/158 [2] https://lkml.org/lkml/2016/7/9/101 Cc: "Eric W. Biederman" <ebied...@xmission.com> Cc: James Bottomley <james.bottom...@hansenpartnership.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> Cc: "W. Trevor King" <wk...@tremily.us> Cc: Alexander Viro <v...@zeniv.linux.org.uk> Cc: Serge Hallyn <serge.hal...@canonical.com> -- 2.5.5 -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces
Hi Andrey, On 07/14/2016 08:20 PM, Andrey Vagin wrote: Each namespace has an owning user namespace and now there is not way to discover these relationships. Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships too. Why we may want to know relationships between namespaces? One use would be visualization, in order to understand the running system. Another would be to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? One more use-case (which usually called abnormal) is checkpoint/restart. In CRIU we age going to dump and restore nested namespaces. There [1] was a discussion about which interface to choose to determing relationships between namespaces. Eric suggested to add two ioctl-s [2]: Grumble, Grumble. I think this may actually a case for creating ioctls for these two cases. Now that random nsfs file descriptors are bind mountable the original reason for using proc files is not as pressing. One ioctl for the user namespace that owns a file descriptor. One ioctl for the parent namespace of a namespace file descriptor. Here is an implementaions of these ioctl-s. Could you add here an of the API in detail: what do these FDs refer to, and how do you use them to solve the use case? And could you you add that info to the commit messages please. Thanks, Michael [1] https://lkml.org/lkml/2016/7/6/158 [2] https://lkml.org/lkml/2016/7/9/101 Cc: "Eric W. Biederman" Cc: James Bottomley Cc: "Michael Kerrisk (man-pages)" Cc: "W. Trevor King" Cc: Alexander Viro Cc: Serge Hallyn -- 2.5.5 -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
man-pages-4.07 is released
Gidday, The Linux man-pages maintainer proudly announces: man-pages-4.07 - man pages for Linux This release includes input and contributions from around 50 people. Over 140 pages saw changes, ranging from typo fixes through to page rewrites and 4 newly created pages. Tarball download: http://www.kernel.org/doc/man-pages/download.html Git repository: https://git.kernel.org/cgit/docs/man-pages/man-pages.git/ Online changelog: http://man7.org/linux/man-pages/changelog.html#release_4.07 A short summary of the release is blogged at: http://linux-man-pages.blogspot.com/2016/07/man-pages-407-is-released.html The current version of the pages is browsable at: http://man7.org/linux/man-pages/ A selection of changes in this release that may be of interest to readers on LKML is shown below. Cheers, Michael Changes in man-pages-4.07 Released: 2016-07-17, Ulm New and rewritten pages --- ioctl_fideduperange.2 Darrick J. Wong [Christoph Hellwig, Michael Kerrisk] New page documenting the FIDEDUPERANGE ioctl Document the FIDEDUPERANGE ioctl, formerly known as BTRFS_IOC_EXTENT_SAME. ioctl_ficlonerange.2 Darrick J. Wong [Christoph Hellwig, Michael Kerrisk] New page documenting FICLONE and FICLONERANGE ioctls Document the FICLONE and FICLONERANGE ioctls, formerly known as the BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls. mount_namespaces.7 Michael Kerrisk [Michael Kerrisk] New page describing mount namespaces Newly documented interfaces in existing pages - mount.2 Michael Kerrisk Document flags used to set propagation type Document MS_SHARED, MS_PRIVATE, MS_SLAVE, and MS_UNBINDABLE. Michael Kerrisk Document the MS_REC flag ptrace.2 Michael Kerrisk [Kees Cook, Jann Horn, Eric W. Biederman, Stephen Smalley] Document ptrace access modes proc.5 Michael Kerrisk Document /proc/[pid]/timerslack_ns Michael Kerrisk Document /proc/PID/status 'Ngid' field Michael Kerrisk Document /proc/PID/status fields: 'NStgid', 'NSpid', 'NSpgid', 'NSsid' Michael Kerrisk Document /proc/PID/status 'Umask' field Changes to individual pages --- ldd.1 Michael Kerrisk Add a little more detail on why ldd is unsafe with untrusted executables futex.2 Michael Kerrisk Correct an ENOSYS error description Since Linux 4.5, FUTEX_CLOCK_REALTIME is allowed with FUTEX_WAIT. Michael Kerrisk [Darren Hart] Remove crufty text about FUTEX_WAIT_BITSET interpretation of timeout Since Linux 4.5, FUTEX_WAIT also understands FUTEX_CLOCK_REALTIME. Michael Kerrisk [Thomas Gleixner] Explain how to get equivalent of FUTEX_WAIT with an absolute timeout Michael Kerrisk Describe FUTEX_BITSET_MATCH_ANY Describe FUTEX_BITSET_MATCH_ANY and FUTEX_WAIT and FUTEX_WAKE equivalences. Michael Kerrisk [Thomas Gleixner, Darren Hart] Fix descriptions of various timeouts Michael Kerrisk Clarify clock default and choices for FUTEX_WAIT kcmp.2 Michael Kerrisk kcmp() is governed by PTRACE_MODE_READ_REALCREDS mount.2 Michael Kerrisk Restructure discussion of 'mountflags' into functional groups The existing text makes no differentiation between different "classes" of mount flags. However, certain flags such as MS_REMOUNT, MS_BIND, MS_MOVE, etc. determine the general type of operation that mount() performs. Furthermore, the choice of which class of operation to perform is performed in a certain order, and that order is significant if multiple flags are specified. Restructure and extend the text to reflect these details. Michael Kerrisk Since Linux 2.6.26, bind mounts can be made read-only process_vm_readv.2 Michael Kerrisk Rephrase permission rules in terms of a ptrace access mode check ptrace.2 Michael Kerrisk [Jann Horn] Update Yama ptrace_scope documentation Reframe the discussion in terms of PTRACE_MODE_ATTACH checks, and make a few other minor tweaks and additions. Michael Kerrisk, Jann Horn Note that user namespaces can be used to bypass Yama protections Michael Kerrisk Note that PTRACE_SEIZE is subject to a ptrace access mode check Michael Kerrisk Rephrase PTRACE_ATTACH permissions in terms of ptrace access mode check wait.2 Michael Kerrisk Since Linux 4.7, __WALL is implied if child being ptraced Michael Kerrisk waitid() now (since Linux 4.7) also supports __WNOTHREAD/__WCLONE/__WALL proc.5 Michael Kerrisk /proc/PID/fd/* ar
man-pages-4.07 is released
Gidday, The Linux man-pages maintainer proudly announces: man-pages-4.07 - man pages for Linux This release includes input and contributions from around 50 people. Over 140 pages saw changes, ranging from typo fixes through to page rewrites and 4 newly created pages. Tarball download: http://www.kernel.org/doc/man-pages/download.html Git repository: https://git.kernel.org/cgit/docs/man-pages/man-pages.git/ Online changelog: http://man7.org/linux/man-pages/changelog.html#release_4.07 A short summary of the release is blogged at: http://linux-man-pages.blogspot.com/2016/07/man-pages-407-is-released.html The current version of the pages is browsable at: http://man7.org/linux/man-pages/ A selection of changes in this release that may be of interest to readers on LKML is shown below. Cheers, Michael Changes in man-pages-4.07 Released: 2016-07-17, Ulm New and rewritten pages --- ioctl_fideduperange.2 Darrick J. Wong [Christoph Hellwig, Michael Kerrisk] New page documenting the FIDEDUPERANGE ioctl Document the FIDEDUPERANGE ioctl, formerly known as BTRFS_IOC_EXTENT_SAME. ioctl_ficlonerange.2 Darrick J. Wong [Christoph Hellwig, Michael Kerrisk] New page documenting FICLONE and FICLONERANGE ioctls Document the FICLONE and FICLONERANGE ioctls, formerly known as the BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls. mount_namespaces.7 Michael Kerrisk [Michael Kerrisk] New page describing mount namespaces Newly documented interfaces in existing pages - mount.2 Michael Kerrisk Document flags used to set propagation type Document MS_SHARED, MS_PRIVATE, MS_SLAVE, and MS_UNBINDABLE. Michael Kerrisk Document the MS_REC flag ptrace.2 Michael Kerrisk [Kees Cook, Jann Horn, Eric W. Biederman, Stephen Smalley] Document ptrace access modes proc.5 Michael Kerrisk Document /proc/[pid]/timerslack_ns Michael Kerrisk Document /proc/PID/status 'Ngid' field Michael Kerrisk Document /proc/PID/status fields: 'NStgid', 'NSpid', 'NSpgid', 'NSsid' Michael Kerrisk Document /proc/PID/status 'Umask' field Changes to individual pages --- ldd.1 Michael Kerrisk Add a little more detail on why ldd is unsafe with untrusted executables futex.2 Michael Kerrisk Correct an ENOSYS error description Since Linux 4.5, FUTEX_CLOCK_REALTIME is allowed with FUTEX_WAIT. Michael Kerrisk [Darren Hart] Remove crufty text about FUTEX_WAIT_BITSET interpretation of timeout Since Linux 4.5, FUTEX_WAIT also understands FUTEX_CLOCK_REALTIME. Michael Kerrisk [Thomas Gleixner] Explain how to get equivalent of FUTEX_WAIT with an absolute timeout Michael Kerrisk Describe FUTEX_BITSET_MATCH_ANY Describe FUTEX_BITSET_MATCH_ANY and FUTEX_WAIT and FUTEX_WAKE equivalences. Michael Kerrisk [Thomas Gleixner, Darren Hart] Fix descriptions of various timeouts Michael Kerrisk Clarify clock default and choices for FUTEX_WAIT kcmp.2 Michael Kerrisk kcmp() is governed by PTRACE_MODE_READ_REALCREDS mount.2 Michael Kerrisk Restructure discussion of 'mountflags' into functional groups The existing text makes no differentiation between different "classes" of mount flags. However, certain flags such as MS_REMOUNT, MS_BIND, MS_MOVE, etc. determine the general type of operation that mount() performs. Furthermore, the choice of which class of operation to perform is performed in a certain order, and that order is significant if multiple flags are specified. Restructure and extend the text to reflect these details. Michael Kerrisk Since Linux 2.6.26, bind mounts can be made read-only process_vm_readv.2 Michael Kerrisk Rephrase permission rules in terms of a ptrace access mode check ptrace.2 Michael Kerrisk [Jann Horn] Update Yama ptrace_scope documentation Reframe the discussion in terms of PTRACE_MODE_ATTACH checks, and make a few other minor tweaks and additions. Michael Kerrisk, Jann Horn Note that user namespaces can be used to bypass Yama protections Michael Kerrisk Note that PTRACE_SEIZE is subject to a ptrace access mode check Michael Kerrisk Rephrase PTRACE_ATTACH permissions in terms of ptrace access mode check wait.2 Michael Kerrisk Since Linux 4.7, __WALL is implied if child being ptraced Michael Kerrisk waitid() now (since Linux 4.7) also supports __WNOTHREAD/__WCLONE/__WALL proc.5 Michael Kerrisk /proc/PID/fd/* ar
Re: Bugzilla spam
Hello Konstantin, On 13 July 2016 at 20:37, Konstantin Ryabitsev <mri...@kernel.org> wrote: > On Wed, Jul 13, 2016 at 08:28:18PM +0200, Michael Kerrisk (man-pages) wrote: >> Hello Konstantin, >> >> The man-pages Bugzilla component (as well as other components on >> Bugzilla by the look of things) is receiving vast quantities of spam. >> What can be done about this? (Just marking the bugs private and >> closing isn't workable. There's just too many bugs coming in...). > > Not much can be done. :( Bugzilla's default spam-fighting capabilities > are abysmal -- I can't even delete any accounts without installing > multiple extensions. I'm actively investigating what we can do to > improve the situation and will follow up shortly. Okay, thanks. In the meantime, is it possible for you to lock the man-pages component so that no further bug reports can be made via that component? Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Bugzilla spam
Hello Konstantin, On 13 July 2016 at 20:37, Konstantin Ryabitsev wrote: > On Wed, Jul 13, 2016 at 08:28:18PM +0200, Michael Kerrisk (man-pages) wrote: >> Hello Konstantin, >> >> The man-pages Bugzilla component (as well as other components on >> Bugzilla by the look of things) is receiving vast quantities of spam. >> What can be done about this? (Just marking the bugs private and >> closing isn't workable. There's just too many bugs coming in...). > > Not much can be done. :( Bugzilla's default spam-fighting capabilities > are abysmal -- I can't even delete any accounts without installing > multiple extensions. I'm actively investigating what we can do to > improve the situation and will follow up shortly. Okay, thanks. In the meantime, is it possible for you to lock the man-pages component so that no further bug reports can be made via that component? Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Bugzilla spam
Hello Konstantin, The man-pages Bugzilla component (as well as other components on Bugzilla by the look of things) is receiving vast quantities of spam. What can be done about this? (Just marking the bugs private and closing isn't workable. There's just too many bugs coming in...). Thanks Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Bugzilla spam
Hello Konstantin, The man-pages Bugzilla component (as well as other components on Bugzilla by the look of things) is receiving vast quantities of spam. What can be done about this? (Just marking the bugs private and closing isn't workable. There's just too many bugs coming in...). Thanks Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [CRIU] Introspecting userns relationships to other namespaces?
On 07/08/2016 05:26 AM, James Bottomley wrote: On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote: On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote: On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote: On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote: On 7 July 2016 at 17:01, James Bottomley <james.bottom...@hansenpartnership.com> wrote: [Serge already answered the parenting issue] On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote: Hm. Probably best-effort based on the process hierarchy. So yeah you could probably get a tree into a state that would be wrongly recreated. Create a new netns, bind mount it, exit; Have another task create a new user_ns, bind mount it, exit; Third task setns()s first to the new netns then to the new user_ns. I suspect criu will recreate that wrongly. This is a bit pathological, and you have to be root to do it: so root can set up a nesting hierarchy, bind it and destroy the pids but I know of no current orchestration system which does this. Actually, I have to back pedal a bit: the way I currently set up architecture emulation containers does precisely this: I set up the namespaces unprivileged with child mount namespaces, but then I ask root to bind the userns and kill the process that created it so I have a permanent handle to enter the namespace by, so I suspect that when our current orchestration systems get more sophisticated, they might eventually want to do something like this as well. In theory, we could get nsfs to show this information as an option (just add a show_options entry to the superblock ops), but the problem is that although each namespace has a parent user_ns, there's no way to get it without digging in the namespace specific structure. Probably we should restructure to move it into ns_common, then we could display it (and enforce all namespaces having owning user_ns) but it would be a I'm missing something here. Is it not already the case that all namespaces have an owning user_ns? Um, yes, I don't believe I said they don't. The problem I thought you were having is that there's no way of seeing what it is. nsfs is the Namespace fileystem where bound namespaces appear to a cat of /proc/self/mounts. It can display any information that's in ns_common (the common core of namespaces) but the owning user_ns pointer currently isn't in this structure. Every user namespace has a pointer to it, but they're all privately embedded in the individual namespace specific structures. What I was proposing was that since every current namespace has a pointer somewhere to the owning user namespace, we could abstract this out into ns_common so it's now accessible to be displayed by nsfs, probably as a mount option. James, I am not sure that I understood you correctly. We have one file system for all namespace files, how we can show per-file properties in mount options. I think we can show all required information in fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read /proc/pid/fdinfo/X for it. Here is a proof-of-concept patch. How it works: In [1]: import os In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY) In [3]: print open("/proc/self/fdinfo/%d" % fd).read() pos:0 flags: 010 mnt_id: 2 userns: 4026531837 In [4]: print "/proc/self/ns/user -> %s" % os.readlink("/proc/self/ns/user") /proc/self/ns/user -> user:[4026531837] can't you just do readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/' ? But what Michael was asking about was the parent user_ns of all the other namespaces ... Just to reiterate, what I'm interested in is the introspection use case (but there's clearly several other interesting use cases here). The idea is to be able to answer these questions 1. For each userns, what is the parent of that userns? 2. For each non-user namespace, what is the owning userns? This enables us to understand the userns hierarchy, which matters in terms of answering the question: what capabilities does process X have in namespace Y? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [CRIU] Introspecting userns relationships to other namespaces?
On 07/08/2016 05:26 AM, James Bottomley wrote: On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote: On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote: On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote: On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote: On 7 July 2016 at 17:01, James Bottomley wrote: [Serge already answered the parenting issue] On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote: Hm. Probably best-effort based on the process hierarchy. So yeah you could probably get a tree into a state that would be wrongly recreated. Create a new netns, bind mount it, exit; Have another task create a new user_ns, bind mount it, exit; Third task setns()s first to the new netns then to the new user_ns. I suspect criu will recreate that wrongly. This is a bit pathological, and you have to be root to do it: so root can set up a nesting hierarchy, bind it and destroy the pids but I know of no current orchestration system which does this. Actually, I have to back pedal a bit: the way I currently set up architecture emulation containers does precisely this: I set up the namespaces unprivileged with child mount namespaces, but then I ask root to bind the userns and kill the process that created it so I have a permanent handle to enter the namespace by, so I suspect that when our current orchestration systems get more sophisticated, they might eventually want to do something like this as well. In theory, we could get nsfs to show this information as an option (just add a show_options entry to the superblock ops), but the problem is that although each namespace has a parent user_ns, there's no way to get it without digging in the namespace specific structure. Probably we should restructure to move it into ns_common, then we could display it (and enforce all namespaces having owning user_ns) but it would be a I'm missing something here. Is it not already the case that all namespaces have an owning user_ns? Um, yes, I don't believe I said they don't. The problem I thought you were having is that there's no way of seeing what it is. nsfs is the Namespace fileystem where bound namespaces appear to a cat of /proc/self/mounts. It can display any information that's in ns_common (the common core of namespaces) but the owning user_ns pointer currently isn't in this structure. Every user namespace has a pointer to it, but they're all privately embedded in the individual namespace specific structures. What I was proposing was that since every current namespace has a pointer somewhere to the owning user namespace, we could abstract this out into ns_common so it's now accessible to be displayed by nsfs, probably as a mount option. James, I am not sure that I understood you correctly. We have one file system for all namespace files, how we can show per-file properties in mount options. I think we can show all required information in fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read /proc/pid/fdinfo/X for it. Here is a proof-of-concept patch. How it works: In [1]: import os In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY) In [3]: print open("/proc/self/fdinfo/%d" % fd).read() pos:0 flags: 010 mnt_id: 2 userns: 4026531837 In [4]: print "/proc/self/ns/user -> %s" % os.readlink("/proc/self/ns/user") /proc/self/ns/user -> user:[4026531837] can't you just do readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/' ? But what Michael was asking about was the parent user_ns of all the other namespaces ... Just to reiterate, what I'm interested in is the introspection use case (but there's clearly several other interesting use cases here). The idea is to be able to answer these questions 1. For each userns, what is the parent of that userns? 2. For each non-user namespace, what is the owning userns? This enables us to understand the userns hierarchy, which matters in terms of answering the question: what capabilities does process X have in namespace Y? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Introspecting userns relationships to other namespaces?
On 07/07/2016 09:17 PM, James Bottomley wrote: On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote: On 7 July 2016 at 17:01, James Bottomley <james.bottom...@hansenpartnership.com> wrote: [Serge already answered the parenting issue] On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote: Hm. Probably best-effort based on the process hierarchy. So yeah you could probably get a tree into a state that would be wrongly recreated. Create a new netns, bind mount it, exit; Have another task create a new user_ns, bind mount it, exit; Third task setns()s first to the new netns then to the new user_ns. I suspect criu will recreate that wrongly. This is a bit pathological, and you have to be root to do it: so root can set up a nesting hierarchy, bind it and destroy the pids but I know of no current orchestration system which does this. Actually, I have to back pedal a bit: the way I currently set up architecture emulation containers does precisely this: I set up the namespaces unprivileged with child mount namespaces, but then I ask root to bind the userns and kill the process that created it so I have a permanent handle to enter the namespace by, so I suspect that when our current orchestration systems get more sophisticated, they might eventually want to do something like this as well. In theory, we could get nsfs to show this information as an option (just add a show_options entry to the superblock ops), but the problem is that although each namespace has a parent user_ns, there's no way to get it without digging in the namespace specific structure. Probably we should restructure to move it into ns_common, then we could display it (and enforce all namespaces having owning user_ns) but it would be a I'm missing something here. Is it not already the case that all namespaces have an owning user_ns? Um, yes, I don't believe I said they don't. The problem I thought you were having is that there's no way of seeing what it is. Your words "and enforce all namespaces having owning user_ns" were what left me puzzled--it sounded to me that the implication was that this is not "enforced" right now. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Introspecting userns relationships to other namespaces?
On 07/07/2016 09:17 PM, James Bottomley wrote: On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote: On 7 July 2016 at 17:01, James Bottomley wrote: [Serge already answered the parenting issue] On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote: Hm. Probably best-effort based on the process hierarchy. So yeah you could probably get a tree into a state that would be wrongly recreated. Create a new netns, bind mount it, exit; Have another task create a new user_ns, bind mount it, exit; Third task setns()s first to the new netns then to the new user_ns. I suspect criu will recreate that wrongly. This is a bit pathological, and you have to be root to do it: so root can set up a nesting hierarchy, bind it and destroy the pids but I know of no current orchestration system which does this. Actually, I have to back pedal a bit: the way I currently set up architecture emulation containers does precisely this: I set up the namespaces unprivileged with child mount namespaces, but then I ask root to bind the userns and kill the process that created it so I have a permanent handle to enter the namespace by, so I suspect that when our current orchestration systems get more sophisticated, they might eventually want to do something like this as well. In theory, we could get nsfs to show this information as an option (just add a show_options entry to the superblock ops), but the problem is that although each namespace has a parent user_ns, there's no way to get it without digging in the namespace specific structure. Probably we should restructure to move it into ns_common, then we could display it (and enforce all namespaces having owning user_ns) but it would be a I'm missing something here. Is it not already the case that all namespaces have an owning user_ns? Um, yes, I don't believe I said they don't. The problem I thought you were having is that there's no way of seeing what it is. Your words "and enforce all namespaces having owning user_ns" were what left me puzzled--it sounded to me that the implication was that this is not "enforced" right now. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Introspecting userns relationships to other namespaces?
On 7 July 2016 at 17:01, James Bottomley <james.bottom...@hansenpartnership.com> wrote: > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote: >> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com): >> > Hi Serge, >> > >> > On 6 July 2016 at 16:13, Serge E. Hallyn <se...@hallyn.com> wrote: >> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man >> > > -pages) wrote: >> > > > [Rats! Doing now what I should have down to start with. Looping >> > > > some lists and CRIU and other possibly relevant people into >> > > > this conversation] >> > > > >> > > > Hi Eric, >> > > > >> > > > On 5 July 2016 at 23:47, Eric W. Biederman < >> > > > ebied...@xmission.com> wrote: >> > > > > "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> >> > > > > writes: >> > > > > >> > > > > > Hi Eric, >> > > > > > >> > > > > > I have a question. Is there any way currently to discover >> > > > > > which user namespace a particular nonuser namespace is >> > > > > > governed by? Maybe I am missing something, but there does >> > > > > > not seem to be a way to do this. Also, can one discover >> > > > > > which userns is the parent of a given userns? Again, I >> > > > > > can't see a way to do this. >> > > > > > >> > > > > > The point here is introspecting so that a process might >> > > > > > determine what its capabilities are when operating on some >> > > > > > resource governed by a (nonuser) namespace. >> > > > > >> > > > > To the best of my knowledge that there is not an interface to >> > > > > get that information. It would be good to have such an >> > > > > interface for no other reason than the CRIU folks are going >> > > > > to need it at some point. I am a bit surprised they have not >> > > > > complained yet. >> > > >> > > I don't think they need it. They do in fact have what they need. >> > > Assume you have tasks T1, T2, T1_1 and T2_1; T1 and T2 are in >> > > init_user_ns; T1 spawned T1_1 in a new userns; T2 spawned T2_1 >> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping, >> > > does not matter. >> > > >> > > At restart, it doesn't matter which task originally created the >> > > new userns. criu knows T1_1 and T2_1 are in the same userns; it >> > > creates the userns, sets up the mapping, and T1_1 and T2_1 >> > > setns() to it. >> > >> > I'm missing something here. How does the parental relationships >> > between the user namespaces get reconstructed? Those relationships >> > will govern what capabilities a process will have in various user >> > namespaces. > > Actually, you get the parent namespace from the process tree by > tracking the user namespaces of the parent pids. Currently non-root > users can't bind the namespace, so the only way to keep a new user_ns > around if you're not root is to keep the process around, so for > multiply nested user namespaces you can usually build the user_ns > hierarchy by looking at the process hierarchy. Conversely, if the > process is reparented to init, chances are that the user_ns is also > parented to init_user_ns. Yes, but "chances are" == this isn't robust. PR_SET_CHILD_SUBREAPER further complicates things. By the way, is that really what happens? Do child user namespaces get reparented to the grandparent ns if the parent ns disappears (i.e., ceases to have any members and no bind mounts)? I hadn't thought about that scenario before. It may be worth documenting in user_namespaces(7). >> Hm. Probably best-effort based on the process hierarchy. So yeah >> you could probably get a tree into a state that would be wrongly >> recreated. Create a new netns, bind mount it, exit; Have another >> task create a new user_ns, bind mount it, exit; Third task setns()s >> first to the new netns then to the new user_ns. I suspect criu will >> recreate that wrongly. > > This is a bit pathological, and you have to be root to do it: so root > can set up a nesting hierarchy, bind it and destroy the pids but I know > of no current orchestration system which does this. > > Actually, I have to back pedal a bit: the way I currently set up > architec
Re: Introspecting userns relationships to other namespaces?
On 7 July 2016 at 17:01, James Bottomley wrote: > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote: >> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com): >> > Hi Serge, >> > >> > On 6 July 2016 at 16:13, Serge E. Hallyn wrote: >> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man >> > > -pages) wrote: >> > > > [Rats! Doing now what I should have down to start with. Looping >> > > > some lists and CRIU and other possibly relevant people into >> > > > this conversation] >> > > > >> > > > Hi Eric, >> > > > >> > > > On 5 July 2016 at 23:47, Eric W. Biederman < >> > > > ebied...@xmission.com> wrote: >> > > > > "Michael Kerrisk (man-pages)" >> > > > > writes: >> > > > > >> > > > > > Hi Eric, >> > > > > > >> > > > > > I have a question. Is there any way currently to discover >> > > > > > which user namespace a particular nonuser namespace is >> > > > > > governed by? Maybe I am missing something, but there does >> > > > > > not seem to be a way to do this. Also, can one discover >> > > > > > which userns is the parent of a given userns? Again, I >> > > > > > can't see a way to do this. >> > > > > > >> > > > > > The point here is introspecting so that a process might >> > > > > > determine what its capabilities are when operating on some >> > > > > > resource governed by a (nonuser) namespace. >> > > > > >> > > > > To the best of my knowledge that there is not an interface to >> > > > > get that information. It would be good to have such an >> > > > > interface for no other reason than the CRIU folks are going >> > > > > to need it at some point. I am a bit surprised they have not >> > > > > complained yet. >> > > >> > > I don't think they need it. They do in fact have what they need. >> > > Assume you have tasks T1, T2, T1_1 and T2_1; T1 and T2 are in >> > > init_user_ns; T1 spawned T1_1 in a new userns; T2 spawned T2_1 >> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping, >> > > does not matter. >> > > >> > > At restart, it doesn't matter which task originally created the >> > > new userns. criu knows T1_1 and T2_1 are in the same userns; it >> > > creates the userns, sets up the mapping, and T1_1 and T2_1 >> > > setns() to it. >> > >> > I'm missing something here. How does the parental relationships >> > between the user namespaces get reconstructed? Those relationships >> > will govern what capabilities a process will have in various user >> > namespaces. > > Actually, you get the parent namespace from the process tree by > tracking the user namespaces of the parent pids. Currently non-root > users can't bind the namespace, so the only way to keep a new user_ns > around if you're not root is to keep the process around, so for > multiply nested user namespaces you can usually build the user_ns > hierarchy by looking at the process hierarchy. Conversely, if the > process is reparented to init, chances are that the user_ns is also > parented to init_user_ns. Yes, but "chances are" == this isn't robust. PR_SET_CHILD_SUBREAPER further complicates things. By the way, is that really what happens? Do child user namespaces get reparented to the grandparent ns if the parent ns disappears (i.e., ceases to have any members and no bind mounts)? I hadn't thought about that scenario before. It may be worth documenting in user_namespaces(7). >> Hm. Probably best-effort based on the process hierarchy. So yeah >> you could probably get a tree into a state that would be wrongly >> recreated. Create a new netns, bind mount it, exit; Have another >> task create a new user_ns, bind mount it, exit; Third task setns()s >> first to the new netns then to the new user_ns. I suspect criu will >> recreate that wrongly. > > This is a bit pathological, and you have to be root to do it: so root > can set up a nesting hierarchy, bind it and destroy the pids but I know > of no current orchestration system which does this. > > Actually, I have to back pedal a bit: the way I currently set up > architecture emulation containers does precisely this: I set up the > namespaces unprivileged wi
Re: Introspecting userns relationships to other namespaces?
Hi Serge, On 6 July 2016 at 16:13, Serge E. Hallyn <se...@hallyn.com> wrote: > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote: >> [Rats! Doing now what I should have down to start with. Looping some >> lists and CRIU and other possibly relevant people into this >> conversation] >> >> Hi Eric, >> >> On 5 July 2016 at 23:47, Eric W. Biederman <ebied...@xmission.com> wrote: >> > "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes: >> > >> >> Hi Eric, >> >> >> >> I have a question. Is there any way currently to discover which >> >> user namespace a particular nonuser namespace is governed by? >> >> Maybe I am missing something, but there does not seem to be a >> >> way to do this. Also, can one discover which userns is the >> >> parent of a given userns? Again, I can't see a way to do this. >> >> >> >> The point here is introspecting so that a process might determine >> >> what its capabilities are when operating on some resource governed >> >> by a (nonuser) namespace. >> > >> > To the best of my knowledge that there is not an interface to get that >> > information. It would be good to have such an interface for no other >> > reason than the CRIU folks are going to need it at some point. I am a >> > bit surprised they have not complained yet. > > I don't think they need it. They do in fact have what they need. Assume > you have tasks T1, T2, T1_1 and T2_1; T1 and T2 are in init_user_ns; T1 > spawned T1_1 in a new userns; T2 spawned T2_1 which setns()d to T1_1's ns. > There's some {handwave} uid mapping, does not matter. > > At restart, it doesn't matter which task originally created the new userns. > criu knows T1_1 and T2_1 are in the same userns; it creates the userns, sets > up the mapping, and T1_1 and T2_1 setns() to it. I'm missing something here. How does the parental relationships between the user namespaces get reconstructed? Those relationships will govern what capabilities a process will have in various user namespaces. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Introspecting userns relationships to other namespaces?
Hi Serge, On 6 July 2016 at 16:13, Serge E. Hallyn wrote: > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote: >> [Rats! Doing now what I should have down to start with. Looping some >> lists and CRIU and other possibly relevant people into this >> conversation] >> >> Hi Eric, >> >> On 5 July 2016 at 23:47, Eric W. Biederman wrote: >> > "Michael Kerrisk (man-pages)" writes: >> > >> >> Hi Eric, >> >> >> >> I have a question. Is there any way currently to discover which >> >> user namespace a particular nonuser namespace is governed by? >> >> Maybe I am missing something, but there does not seem to be a >> >> way to do this. Also, can one discover which userns is the >> >> parent of a given userns? Again, I can't see a way to do this. >> >> >> >> The point here is introspecting so that a process might determine >> >> what its capabilities are when operating on some resource governed >> >> by a (nonuser) namespace. >> > >> > To the best of my knowledge that there is not an interface to get that >> > information. It would be good to have such an interface for no other >> > reason than the CRIU folks are going to need it at some point. I am a >> > bit surprised they have not complained yet. > > I don't think they need it. They do in fact have what they need. Assume > you have tasks T1, T2, T1_1 and T2_1; T1 and T2 are in init_user_ns; T1 > spawned T1_1 in a new userns; T2 spawned T2_1 which setns()d to T1_1's ns. > There's some {handwave} uid mapping, does not matter. > > At restart, it doesn't matter which task originally created the new userns. > criu knows T1_1 and T2_1 are in the same userns; it creates the userns, sets > up the mapping, and T1_1 and T2_1 setns() to it. I'm missing something here. How does the parental relationships between the user namespaces get reconstructed? Those relationships will govern what capabilities a process will have in various user namespaces. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Introspecting userns relationships to other namespaces?
[Rats! Doing now what I should have down to start with. Looping some lists and CRIU and other possibly relevant people into this conversation] Hi Eric, On 5 July 2016 at 23:47, Eric W. Biederman <ebied...@xmission.com> wrote: > "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes: > >> Hi Eric, >> >> I have a question. Is there any way currently to discover which >> user namespace a particular nonuser namespace is governed by? >> Maybe I am missing something, but there does not seem to be a >> way to do this. Also, can one discover which userns is the >> parent of a given userns? Again, I can't see a way to do this. >> >> The point here is introspecting so that a process might determine >> what its capabilities are when operating on some resource governed >> by a (nonuser) namespace. > > To the best of my knowledge that there is not an interface to get that > information. It would be good to have such an interface for no other > reason than the CRIU folks are going to need it at some point. I am a > bit surprised they have not complained yet. > > That said in a normal use scenario I don't think that information is > needed. > > Do you have a particular use case besides checkpoint/restart where this > is useful? That might help in coming up with a good userspace interface > for this information. So, I spend a moderate amount of time working with people to introduce them to the namespaces infrastructure, and one topic that comes up now and this introspection/visualization tools. For example, nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields in /proc/PID--it's possible to (and someone I was working with did) write tools that introspect the PID namespace hierarchy to show all of process's and their PIDs in the various namespace instance. It's a natural enough thing to want to do, when confronted with the complexity of the namespaces. Someone else then asked me a question that led me to wonder about generally introspecting on the parental relationships between user namespaces and the association of other namespaces types with user namespaces. One use would be visualization, in order to understand the running system. Another would be to answer the question I already mentioned: what capability does process X have to perform operations on a resource governed by namespace Y? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Introspecting userns relationships to other namespaces?
[Rats! Doing now what I should have down to start with. Looping some lists and CRIU and other possibly relevant people into this conversation] Hi Eric, On 5 July 2016 at 23:47, Eric W. Biederman wrote: > "Michael Kerrisk (man-pages)" writes: > >> Hi Eric, >> >> I have a question. Is there any way currently to discover which >> user namespace a particular nonuser namespace is governed by? >> Maybe I am missing something, but there does not seem to be a >> way to do this. Also, can one discover which userns is the >> parent of a given userns? Again, I can't see a way to do this. >> >> The point here is introspecting so that a process might determine >> what its capabilities are when operating on some resource governed >> by a (nonuser) namespace. > > To the best of my knowledge that there is not an interface to get that > information. It would be good to have such an interface for no other > reason than the CRIU folks are going to need it at some point. I am a > bit surprised they have not complained yet. > > That said in a normal use scenario I don't think that information is > needed. > > Do you have a particular use case besides checkpoint/restart where this > is useful? That might help in coming up with a good userspace interface > for this information. So, I spend a moderate amount of time working with people to introduce them to the namespaces infrastructure, and one topic that comes up now and this introspection/visualization tools. For example, nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields in /proc/PID--it's possible to (and someone I was working with did) write tools that introspect the PID namespace hierarchy to show all of process's and their PIDs in the various namespace instance. It's a natural enough thing to want to do, when confronted with the complexity of the namespaces. Someone else then asked me a question that led me to wonder about generally introspecting on the parental relationships between user namespaces and the association of other namespaces types with user namespaces. One use would be visualization, in order to understand the running system. Another would be to answer the question I already mentioned: what capability does process X have to perform operations on a resource governed by namespace Y? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Review of ptrace Yama ptrace_scope description
Hi Kees, On 06/28/2016 10:55 PM, Kees Cook wrote: On Mon, Jun 27, 2016 at 11:11 PM, Michael Kerrisk (man-pages) <mtk.manpa...@gmail.com> wrote: Hi Jann, On 06/25/2016 04:30 PM, Jann Horn wrote: On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages) wrote: Hi Kees, So, last year, I added some documentation to ptrace(2) to describe the Yama ptrace_scope file. I don't think I asked you for review at the time, but in the light of other changes to the ptrace(2) page, it occurred to me that it might be a good idea to ask you to check the text below to see if anything is missing or could be improved. Might you have a moment for that? /proc/sys/kernel/yama/ptrace_scope On systems with the Yama Linux Security Module (LSM) installed (i.e., the kernel was configured with CONFIG_SECURITY_YAMA), the /proc/sys/kernel/yama/ptrace_scope file (available since Linux 3.4) can be used to restrict the ability to trace a process with ptrace(2) (and thus also the ability to use tools such as strace(1) and gdb(1)). The goal of such restrictions is to prevent attack escalation whereby a compromised process can ptrace-attach to other sensitive processes (e.g., a GPG agent or an SSH session) owned by the user in order to gain additional credentials and thus expand the scope of the attack. Maybe clarify "additional credentials that may exist in memory only and thus..." Done. More precisely, the Yama LSM limits two types of operations: * Any operation that performs a ptrace access mode PTRACE_MODE_ATTACH check—for example, ptrace() PTRACE_ATTACH. (See the "Ptrace access mode checking" dis‐ cussion above.) * ptrace() PTRACE_TRACEME. A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐ ing values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the commoncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] When performing an operation that requires a PTRACE_MODE_ATTACH check, the calling process must have a predefined relationship with the target process. By default, the predefined relationship is that the target process must be a child of the caller. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/secu‐ rity/Yama.txt for further details. The use of PTRACE_TRACEME is unchanged. (namespaced) CAP_SYS_PTRACE is also sufficient here. Both here and in the "admin-only attach" case, it is IMO important to note that creating a user namespace effectively removes the Yama protection because the owner of a namespace, when accessing its contents from outside, is relatively capable. This means that when a process tries to use namespaces to sandbox itself, it inadvertently makes itself more accessible. (This could probably be worked around in the kernel, but such a workaround would likely not be default, but rather opt-in via a new flag for clone() and unshare() or so.) Tanks for catching this! So I've made that section of text: A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the following values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐ moncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] Whenperforminganoperation that requires a PTRACE_MODE_ATTACH check, the calling process must either have the CAP_SYS_PTRACE capability in the user namespace of the target process or it have a predefined relationship with the target process. By default, the predefined rela‐ tionship is that the target process must be a child of the caller. More accurately, must be a descendant of the caller (grand child is fine, etc). Thanks, Fixed. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID
Re: Review of ptrace Yama ptrace_scope description
Hi Kees, On 06/28/2016 10:55 PM, Kees Cook wrote: On Mon, Jun 27, 2016 at 11:11 PM, Michael Kerrisk (man-pages) wrote: Hi Jann, On 06/25/2016 04:30 PM, Jann Horn wrote: On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages) wrote: Hi Kees, So, last year, I added some documentation to ptrace(2) to describe the Yama ptrace_scope file. I don't think I asked you for review at the time, but in the light of other changes to the ptrace(2) page, it occurred to me that it might be a good idea to ask you to check the text below to see if anything is missing or could be improved. Might you have a moment for that? /proc/sys/kernel/yama/ptrace_scope On systems with the Yama Linux Security Module (LSM) installed (i.e., the kernel was configured with CONFIG_SECURITY_YAMA), the /proc/sys/kernel/yama/ptrace_scope file (available since Linux 3.4) can be used to restrict the ability to trace a process with ptrace(2) (and thus also the ability to use tools such as strace(1) and gdb(1)). The goal of such restrictions is to prevent attack escalation whereby a compromised process can ptrace-attach to other sensitive processes (e.g., a GPG agent or an SSH session) owned by the user in order to gain additional credentials and thus expand the scope of the attack. Maybe clarify "additional credentials that may exist in memory only and thus..." Done. More precisely, the Yama LSM limits two types of operations: * Any operation that performs a ptrace access mode PTRACE_MODE_ATTACH check—for example, ptrace() PTRACE_ATTACH. (See the "Ptrace access mode checking" dis‐ cussion above.) * ptrace() PTRACE_TRACEME. A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐ ing values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the commoncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] When performing an operation that requires a PTRACE_MODE_ATTACH check, the calling process must have a predefined relationship with the target process. By default, the predefined relationship is that the target process must be a child of the caller. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/secu‐ rity/Yama.txt for further details. The use of PTRACE_TRACEME is unchanged. (namespaced) CAP_SYS_PTRACE is also sufficient here. Both here and in the "admin-only attach" case, it is IMO important to note that creating a user namespace effectively removes the Yama protection because the owner of a namespace, when accessing its contents from outside, is relatively capable. This means that when a process tries to use namespaces to sandbox itself, it inadvertently makes itself more accessible. (This could probably be worked around in the kernel, but such a workaround would likely not be default, but rather opt-in via a new flag for clone() and unshare() or so.) Tanks for catching this! So I've made that section of text: A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the following values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐ moncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] Whenperforminganoperation that requires a PTRACE_MODE_ATTACH check, the calling process must either have the CAP_SYS_PTRACE capability in the user namespace of the target process or it have a predefined relationship with the target process. By default, the predefined rela‐ tionship is that the target process must be a child of the caller. More accurately, must be a descendant of the caller (grand child is fine, etc). Thanks, Fixed. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID that is allowed to
Re: Review of ptrace Yama ptrace_scope description
Hi Jann, ... So I've made that section of text: A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the following values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐ moncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] Whenperforminganoperation that requires a PTRACE_MODE_ATTACH check, the calling process must either have the CAP_SYS_PTRACE capability in the user namespace of the target process or it have a predefined relationship with the target process. Nit: The grammar in this sentence seems wrong to me. s/or it have/or it must have/? Yep, thanks for catching that. Fixed now. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Review of ptrace Yama ptrace_scope description
Hi Jann, ... So I've made that section of text: A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the following values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐ moncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] Whenperforminganoperation that requires a PTRACE_MODE_ATTACH check, the calling process must either have the CAP_SYS_PTRACE capability in the user namespace of the target process or it have a predefined relationship with the target process. Nit: The grammar in this sentence seems wrong to me. s/or it have/or it must have/? Yep, thanks for catching that. Fixed now. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Review of ptrace Yama ptrace_scope description
Hi Jann, On 06/25/2016 04:30 PM, Jann Horn wrote: On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages) wrote: Hi Kees, So, last year, I added some documentation to ptrace(2) to describe the Yama ptrace_scope file. I don't think I asked you for review at the time, but in the light of other changes to the ptrace(2) page, it occurred to me that it might be a good idea to ask you to check the text below to see if anything is missing or could be improved. Might you have a moment for that? /proc/sys/kernel/yama/ptrace_scope On systems with the Yama Linux Security Module (LSM) installed (i.e., the kernel was configured with CONFIG_SECURITY_YAMA), the /proc/sys/kernel/yama/ptrace_scope file (available since Linux 3.4) can be used to restrict the ability to trace a process with ptrace(2) (and thus also the ability to use tools such as strace(1) and gdb(1)). The goal of such restrictions is to prevent attack escalation whereby a compromised process can ptrace-attach to other sensitive processes (e.g., a GPG agent or an SSH session) owned by the user in order to gain additional credentials and thus expand the scope of the attack. More precisely, the Yama LSM limits two types of operations: * Any operation that performs a ptrace access mode PTRACE_MODE_ATTACH check—for example, ptrace() PTRACE_ATTACH. (See the "Ptrace access mode checking" dis‐ cussion above.) * ptrace() PTRACE_TRACEME. A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐ ing values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the commoncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] When performing an operation that requires a PTRACE_MODE_ATTACH check, the calling process must have a predefined relationship with the target process. By default, the predefined relationship is that the target process must be a child of the caller. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/secu‐ rity/Yama.txt for further details. The use of PTRACE_TRACEME is unchanged. (namespaced) CAP_SYS_PTRACE is also sufficient here. Both here and in the "admin-only attach" case, it is IMO important to note that creating a user namespace effectively removes the Yama protection because the owner of a namespace, when accessing its contents from outside, is relatively capable. This means that when a process tries to use namespaces to sandbox itself, it inadvertently makes itself more accessible. (This could probably be worked around in the kernel, but such a workaround would likely not be default, but rather opt-in via a new flag for clone() and unshare() or so.) Tanks for catching this! So I've made that section of text: A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the following values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐ moncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] Whenperforminganoperation that requires a PTRACE_MODE_ATTACH check, the calling process must either have the CAP_SYS_PTRACE capability in the user namespace of the target process or it have a predefined relationship with the target process. By default, the predefined rela‐ tionship is that the target process must be a child of the caller. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/security/Yama.txt for further details. The use of PTRACE_TRACEME is unchanged. 2 ("admin-only attach") Only processes with the CAP_SYS_PTRACE capability
Re: Review of ptrace Yama ptrace_scope description
Hi Jann, On 06/25/2016 04:30 PM, Jann Horn wrote: On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages) wrote: Hi Kees, So, last year, I added some documentation to ptrace(2) to describe the Yama ptrace_scope file. I don't think I asked you for review at the time, but in the light of other changes to the ptrace(2) page, it occurred to me that it might be a good idea to ask you to check the text below to see if anything is missing or could be improved. Might you have a moment for that? /proc/sys/kernel/yama/ptrace_scope On systems with the Yama Linux Security Module (LSM) installed (i.e., the kernel was configured with CONFIG_SECURITY_YAMA), the /proc/sys/kernel/yama/ptrace_scope file (available since Linux 3.4) can be used to restrict the ability to trace a process with ptrace(2) (and thus also the ability to use tools such as strace(1) and gdb(1)). The goal of such restrictions is to prevent attack escalation whereby a compromised process can ptrace-attach to other sensitive processes (e.g., a GPG agent or an SSH session) owned by the user in order to gain additional credentials and thus expand the scope of the attack. More precisely, the Yama LSM limits two types of operations: * Any operation that performs a ptrace access mode PTRACE_MODE_ATTACH check—for example, ptrace() PTRACE_ATTACH. (See the "Ptrace access mode checking" dis‐ cussion above.) * ptrace() PTRACE_TRACEME. A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐ ing values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the commoncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] When performing an operation that requires a PTRACE_MODE_ATTACH check, the calling process must have a predefined relationship with the target process. By default, the predefined relationship is that the target process must be a child of the caller. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/secu‐ rity/Yama.txt for further details. The use of PTRACE_TRACEME is unchanged. (namespaced) CAP_SYS_PTRACE is also sufficient here. Both here and in the "admin-only attach" case, it is IMO important to note that creating a user namespace effectively removes the Yama protection because the owner of a namespace, when accessing its contents from outside, is relatively capable. This means that when a process tries to use namespaces to sandbox itself, it inadvertently makes itself more accessible. (This could probably be worked around in the kernel, but such a workaround would likely not be default, but rather opt-in via a new flag for clone() and unshare() or so.) Tanks for catching this! So I've made that section of text: A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the following values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐ moncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] Whenperforminganoperation that requires a PTRACE_MODE_ATTACH check, the calling process must either have the CAP_SYS_PTRACE capability in the user namespace of the target process or it have a predefined relationship with the target process. By default, the predefined rela‐ tionship is that the target process must be a child of the caller. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/security/Yama.txt for further details. The use of PTRACE_TRACEME is unchanged. 2 ("admin-only attach") Only processes with the CAP_SYS_PTRACE capability
Re: [PATCH v2 2/2] namespaces: add transparent user namespaces
Hi Jann, Patches such as this really should CC linux-api@ (added). On Sat, Jun 25, 2016 at 2:23 AM, Jann Hornwrote: > This allows the admin of a user namespace to mark the namespace as > transparent. All other namespaces, by default, are opaque. > > While the current behavior of user namespaces is appropriate for use in > containers, there are many programs that only use user namespaces because > doing so enables them to do other things (e.g. unsharing the mount or > network namespace) that require namespaced capabilities. For them, the > inability to see the real UIDs and GIDs of things from inside the user > namespace can be very annoying. > > In a transparent namespace, all UIDs and GIDs that are mapped into its > first opaque ancestor are visible and are not remapped. This means that if > a process e.g. stat()s the real root directory in a namespace, it will > still see it as owned by UID 0. > > Traditionally, any UID or GID that was visible in a user namespace was also > mapped into the namespace, giving the namespace admin full access to it. > This patch introduces a distinction: In a transparent namespace, UIDs and > GIDs can be visible without being mapped. Non-mapped, visible UIDs can be > passed from the kernel to userspace, but userspace can't send them back to > the kernel. Can you explain "can't send them back to the kernel" in more detail? (Some examples of what is and isn't possible would be helpul.) > In order to be able to fully use specific UIDs/GIDs and gain > privileges over them, mappings need to be set up in the usual way - > however, to avoid aliasing problems, only identity mappings are permitted. > > v2: > Ensure that all relevant from_k[ug]id callers show up in the patch. > _transparent would be more verbose than _tp, but considering the line > length rule, that's just too long. > > Yes, this makes the patch rather large. > > Behavior should be the same as in v1, except that I'm not touching orangefs > in this patch because every single use of from_k[ug]id in it is wrong in > some way. (Thanks for making me reread all that stuff, Eric.) I'll write a > separate patch or at least report the issue with more detail later. > > (Also, the handling of user namespaces when dealing with signals is > super-ugly and kind of incorrect. That should probably be cleaned up.) I'm curious about this detail: can you say some more about the issues here? > posix_acl_to_xattr would have changed behavior in the v1 patch, but isn't > changed here. Because it's only used with init_user_ns, that won't change > user-visible behavior relative to v1. > > This patch was compile-tested with allyesconfig. I also ran a VM with this > patch applied and checked that it still works, but that probably doesn't > mean much. One of the things notably lacking from this commit message is any sort of description of the user-space-API changes that it makes. I presume it's a matter of some /proc files. Could you explain the changes (ad add that detail in any further commit message)? Thanks, Michael > Signed-off-by: Jann Horn > --- > arch/alpha/kernel/osf_sys.c | 4 +- > arch/arm/kernel/sys_oabi-compat.c | 4 +- > arch/ia64/kernel/signal.c | 4 +- > arch/s390/kernel/compat_linux.c | 26 +++--- > arch/sparc/kernel/sys_sparc32.c | 4 +- > arch/x86/ia32/sys_ia32.c | 4 +- > drivers/android/binder.c | 2 +- > drivers/gpu/drm/drm_info.c| 2 +- > drivers/gpu/drm/drm_ioctl.c | 2 +- > drivers/net/tun.c | 4 +- > fs/autofs4/dev-ioctl.c| 4 +- > fs/autofs4/waitq.c| 4 +- > fs/binfmt_elf.c | 12 +-- > fs/binfmt_elf_fdpic.c | 12 +-- > fs/compat.c | 4 +- > fs/fcntl.c| 4 +- > fs/ncpfs/ioctl.c | 12 +-- > fs/posix_acl.c| 11 ++- > fs/proc/array.c | 18 ++-- > fs/proc/base.c| 30 +-- > fs/quota/kqid.c | 12 ++- > fs/stat.c | 12 +-- > include/linux/uidgid.h| 24 +++-- > include/linux/user_namespace.h| 4 + > include/net/scm.h | 4 +- > ipc/mqueue.c | 2 +- > ipc/msg.c | 8 +- > ipc/sem.c | 8 +- > ipc/shm.c | 8 +- > ipc/util.c| 8 +- > kernel/acct.c | 4 +- > kernel/exit.c | 6 +- > kernel/groups.c | 2 +- > kernel/signal.c | 16 ++-- > kernel/sys.c | 24 ++--- > kernel/trace/trace.c | 2 +- > kernel/tsacct.c | 4 +- > kernel/uid16.c| 22 ++--- > kernel/user.c | 1 + > kernel/user_namespace.c | 178 >
Re: [PATCH v2 2/2] namespaces: add transparent user namespaces
Hi Jann, Patches such as this really should CC linux-api@ (added). On Sat, Jun 25, 2016 at 2:23 AM, Jann Horn wrote: > This allows the admin of a user namespace to mark the namespace as > transparent. All other namespaces, by default, are opaque. > > While the current behavior of user namespaces is appropriate for use in > containers, there are many programs that only use user namespaces because > doing so enables them to do other things (e.g. unsharing the mount or > network namespace) that require namespaced capabilities. For them, the > inability to see the real UIDs and GIDs of things from inside the user > namespace can be very annoying. > > In a transparent namespace, all UIDs and GIDs that are mapped into its > first opaque ancestor are visible and are not remapped. This means that if > a process e.g. stat()s the real root directory in a namespace, it will > still see it as owned by UID 0. > > Traditionally, any UID or GID that was visible in a user namespace was also > mapped into the namespace, giving the namespace admin full access to it. > This patch introduces a distinction: In a transparent namespace, UIDs and > GIDs can be visible without being mapped. Non-mapped, visible UIDs can be > passed from the kernel to userspace, but userspace can't send them back to > the kernel. Can you explain "can't send them back to the kernel" in more detail? (Some examples of what is and isn't possible would be helpul.) > In order to be able to fully use specific UIDs/GIDs and gain > privileges over them, mappings need to be set up in the usual way - > however, to avoid aliasing problems, only identity mappings are permitted. > > v2: > Ensure that all relevant from_k[ug]id callers show up in the patch. > _transparent would be more verbose than _tp, but considering the line > length rule, that's just too long. > > Yes, this makes the patch rather large. > > Behavior should be the same as in v1, except that I'm not touching orangefs > in this patch because every single use of from_k[ug]id in it is wrong in > some way. (Thanks for making me reread all that stuff, Eric.) I'll write a > separate patch or at least report the issue with more detail later. > > (Also, the handling of user namespaces when dealing with signals is > super-ugly and kind of incorrect. That should probably be cleaned up.) I'm curious about this detail: can you say some more about the issues here? > posix_acl_to_xattr would have changed behavior in the v1 patch, but isn't > changed here. Because it's only used with init_user_ns, that won't change > user-visible behavior relative to v1. > > This patch was compile-tested with allyesconfig. I also ran a VM with this > patch applied and checked that it still works, but that probably doesn't > mean much. One of the things notably lacking from this commit message is any sort of description of the user-space-API changes that it makes. I presume it's a matter of some /proc files. Could you explain the changes (ad add that detail in any further commit message)? Thanks, Michael > Signed-off-by: Jann Horn > --- > arch/alpha/kernel/osf_sys.c | 4 +- > arch/arm/kernel/sys_oabi-compat.c | 4 +- > arch/ia64/kernel/signal.c | 4 +- > arch/s390/kernel/compat_linux.c | 26 +++--- > arch/sparc/kernel/sys_sparc32.c | 4 +- > arch/x86/ia32/sys_ia32.c | 4 +- > drivers/android/binder.c | 2 +- > drivers/gpu/drm/drm_info.c| 2 +- > drivers/gpu/drm/drm_ioctl.c | 2 +- > drivers/net/tun.c | 4 +- > fs/autofs4/dev-ioctl.c| 4 +- > fs/autofs4/waitq.c| 4 +- > fs/binfmt_elf.c | 12 +-- > fs/binfmt_elf_fdpic.c | 12 +-- > fs/compat.c | 4 +- > fs/fcntl.c| 4 +- > fs/ncpfs/ioctl.c | 12 +-- > fs/posix_acl.c| 11 ++- > fs/proc/array.c | 18 ++-- > fs/proc/base.c| 30 +-- > fs/quota/kqid.c | 12 ++- > fs/stat.c | 12 +-- > include/linux/uidgid.h| 24 +++-- > include/linux/user_namespace.h| 4 + > include/net/scm.h | 4 +- > ipc/mqueue.c | 2 +- > ipc/msg.c | 8 +- > ipc/sem.c | 8 +- > ipc/shm.c | 8 +- > ipc/util.c| 8 +- > kernel/acct.c | 4 +- > kernel/exit.c | 6 +- > kernel/groups.c | 2 +- > kernel/signal.c | 16 ++-- > kernel/sys.c | 24 ++--- > kernel/trace/trace.c | 2 +- > kernel/tsacct.c | 4 +- > kernel/uid16.c| 22 ++--- > kernel/user.c | 1 + > kernel/user_namespace.c | 178 > +++--- >
Review of ptrace Yama ptrace_scope description
Hi Kees, So, last year, I added some documentation to ptrace(2) to describe the Yama ptrace_scope file. I don't think I asked you for review at the time, but in the light of other changes to the ptrace(2) page, it occurred to me that it might be a good idea to ask you to check the text below to see if anything is missing or could be improved. Might you have a moment for that? /proc/sys/kernel/yama/ptrace_scope On systems with the Yama Linux Security Module (LSM) installed (i.e., the kernel was configured with CONFIG_SECURITY_YAMA), the /proc/sys/kernel/yama/ptrace_scope file (available since Linux 3.4) can be used to restrict the ability to trace a process with ptrace(2) (and thus also the ability to use tools such as strace(1) and gdb(1)). The goal of such restrictions is to prevent attack escalation whereby a compromised process can ptrace-attach to other sensitive processes (e.g., a GPG agent or an SSH session) owned by the user in order to gain additional credentials and thus expand the scope of the attack. More precisely, the Yama LSM limits two types of operations: * Any operation that performs a ptrace access mode PTRACE_MODE_ATTACH check—for example, ptrace() PTRACE_ATTACH. (See the "Ptrace access mode checking" dis‐ cussion above.) * ptrace() PTRACE_TRACEME. A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐ ing values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the commoncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] When performing an operation that requires a PTRACE_MODE_ATTACH check, the calling process must have a predefined relationship with the target process. By default, the predefined relationship is that the target process must be a child of the caller. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/secu‐ rity/Yama.txt for further details. The use of PTRACE_TRACEME is unchanged. 2 ("admin-only attach") Only processes with the CAP_SYS_PTRACE capability may perform PTRACE_MODE_ATTACH operations or trace children that employ PTRACE_TRACEME. 3 ("no attach") No process may perform PTRACE_MODE_ATTACH operations or trace children that employ PTRACE_TRACEME. Once this value has been written to the file, it cannot be changed. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Review of ptrace Yama ptrace_scope description
Hi Kees, So, last year, I added some documentation to ptrace(2) to describe the Yama ptrace_scope file. I don't think I asked you for review at the time, but in the light of other changes to the ptrace(2) page, it occurred to me that it might be a good idea to ask you to check the text below to see if anything is missing or could be improved. Might you have a moment for that? /proc/sys/kernel/yama/ptrace_scope On systems with the Yama Linux Security Module (LSM) installed (i.e., the kernel was configured with CONFIG_SECURITY_YAMA), the /proc/sys/kernel/yama/ptrace_scope file (available since Linux 3.4) can be used to restrict the ability to trace a process with ptrace(2) (and thus also the ability to use tools such as strace(1) and gdb(1)). The goal of such restrictions is to prevent attack escalation whereby a compromised process can ptrace-attach to other sensitive processes (e.g., a GPG agent or an SSH session) owned by the user in order to gain additional credentials and thus expand the scope of the attack. More precisely, the Yama LSM limits two types of operations: * Any operation that performs a ptrace access mode PTRACE_MODE_ATTACH check—for example, ptrace() PTRACE_ATTACH. (See the "Ptrace access mode checking" dis‐ cussion above.) * ptrace() PTRACE_TRACEME. A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐ ing values: 0 ("classic ptrace permissions") No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the commoncap and other LSMs). The use of PTRACE_TRACEME is unchanged. 1 ("restricted ptrace") [default value] When performing an operation that requires a PTRACE_MODE_ATTACH check, the calling process must have a predefined relationship with the target process. By default, the predefined relationship is that the target process must be a child of the caller. A target process can employ the prctl(2) PR_SET_PTRACER operation to declare a different PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/secu‐ rity/Yama.txt for further details. The use of PTRACE_TRACEME is unchanged. 2 ("admin-only attach") Only processes with the CAP_SYS_PTRACE capability may perform PTRACE_MODE_ATTACH operations or trace children that employ PTRACE_TRACEME. 3 ("no attach") No process may perform PTRACE_MODE_ATTACH operations or trace children that employ PTRACE_TRACEME. Once this value has been written to the file, it cannot be changed. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
On 06/24/2016 05:18 PM, Casey Schaufler wrote: On 6/24/2016 1:40 AM, Michael Kerrisk (man-pages) wrote: On 06/22/2016 11:11 PM, Kees Cook wrote: On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages) <mtk.manpa...@gmail.com> wrote: On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: 5. The kernel LSM security_ptrace_access_check() interface is invoked to see if ptrace access is permitted. The results depend on the LSM. The implementation of this interface in the default LSM performs the following steps: For people who are unaware of how the LSM API works, it might be good to clarify that the commoncap LSM is *always* invoked; otherwise, it might give the impression that using another LSM would replace it. As we can see, I am one of those who are unaware of how the LSM API works :-/. (Also, are there other documents that refer to it as "default LSM"? I think that that term is slightly confusing.) No, that's a terminological confusion of my own making. Fixed now. I changed this text to: Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by any enabled Linux Security Module (LSMs)—for example, SELinux, Yama, or Smack—and by the the commoncap LSM (which is always invoked). Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: BTW, can you point me at the piece(s) of kernel code that show that "commoncap" is always invoked in addition to any other LSM that has been installed? It's not entirely obvious, but the bottom of security/commoncap.c shows: #ifdef CONFIG_SECURITY struct security_hook_list capability_hooks[] = { LSM_HOOK_INIT(capable, cap_capable), ... }; void __init capability_add_hooks(void) { security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks)); } #endif And security/security.c shows the initialization order of the LSMs: int __init security_init(void) { pr_info("Security Framework initialized\n"); /* * Load minor LSMs, with the capability module always first. */ capability_add_hooks(); yama_add_hooks(); loadpin_add_hooks(); /* * Load all the remaining security modules. */ do_security_initcalls(); return 0; } So, I just want to check my understanding of a couple of points: 1. The commoncap LSM is invoked first, and if it denies access, then no further LSM is/needs to be called. Yes. The LSM infrastructure is "bail on fail". 2. Is it the case that only one of the other LSMs (SELinux, Yama, AppArmor, etc.) is invoked, or can more than one be invoked. I thought only one is invoked, but perhaps I am out of date in my understanding. All registered modules are invoked, but only one "major" module can be registered. The "minor" modules show up in security_init, while the majors come in via do_security_initcalls. I am in the process of messing that all up with patches allowing multiple major modules. Stay tuned. Thanks for the info, Casey. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
On 06/24/2016 05:18 PM, Casey Schaufler wrote: On 6/24/2016 1:40 AM, Michael Kerrisk (man-pages) wrote: On 06/22/2016 11:11 PM, Kees Cook wrote: On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages) wrote: On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: 5. The kernel LSM security_ptrace_access_check() interface is invoked to see if ptrace access is permitted. The results depend on the LSM. The implementation of this interface in the default LSM performs the following steps: For people who are unaware of how the LSM API works, it might be good to clarify that the commoncap LSM is *always* invoked; otherwise, it might give the impression that using another LSM would replace it. As we can see, I am one of those who are unaware of how the LSM API works :-/. (Also, are there other documents that refer to it as "default LSM"? I think that that term is slightly confusing.) No, that's a terminological confusion of my own making. Fixed now. I changed this text to: Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by any enabled Linux Security Module (LSMs)—for example, SELinux, Yama, or Smack—and by the the commoncap LSM (which is always invoked). Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: BTW, can you point me at the piece(s) of kernel code that show that "commoncap" is always invoked in addition to any other LSM that has been installed? It's not entirely obvious, but the bottom of security/commoncap.c shows: #ifdef CONFIG_SECURITY struct security_hook_list capability_hooks[] = { LSM_HOOK_INIT(capable, cap_capable), ... }; void __init capability_add_hooks(void) { security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks)); } #endif And security/security.c shows the initialization order of the LSMs: int __init security_init(void) { pr_info("Security Framework initialized\n"); /* * Load minor LSMs, with the capability module always first. */ capability_add_hooks(); yama_add_hooks(); loadpin_add_hooks(); /* * Load all the remaining security modules. */ do_security_initcalls(); return 0; } So, I just want to check my understanding of a couple of points: 1. The commoncap LSM is invoked first, and if it denies access, then no further LSM is/needs to be called. Yes. The LSM infrastructure is "bail on fail". 2. Is it the case that only one of the other LSMs (SELinux, Yama, AppArmor, etc.) is invoked, or can more than one be invoked. I thought only one is invoked, but perhaps I am out of date in my understanding. All registered modules are invoked, but only one "major" module can be registered. The "minor" modules show up in security_init, while the majors come in via do_security_initcalls. I am in the process of messing that all up with patches allowing multiple major modules. Stay tuned. Thanks for the info, Casey. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
On 06/24/2016 11:52 AM, Thomas Gleixner wrote: On Fri, 24 Jun 2016, Michael Kerrisk (man-pages) wrote: By the way, I just realized something that wasn't initially obvious to me, and documented it in the futex(2) man page: Note: for FUTEX_WAIT, timeout is interpreted as a relative value. This differs from other futex oper‐ ations, where timeout is interpreted as an absolute value. To obtain the equivalent of FUTEX_WAIT with an absolute timeout, employ FUTEX_WAIT_BITSET with val3 specified as FUTEX_BITSET_MATCH_ANY. Okay? Yes. Thanks, Thomas. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
On 06/24/2016 11:52 AM, Thomas Gleixner wrote: On Fri, 24 Jun 2016, Michael Kerrisk (man-pages) wrote: By the way, I just realized something that wasn't initially obvious to me, and documented it in the futex(2) man page: Note: for FUTEX_WAIT, timeout is interpreted as a relative value. This differs from other futex oper‐ ations, where timeout is interpreted as an absolute value. To obtain the equivalent of FUTEX_WAIT with an absolute timeout, employ FUTEX_WAIT_BITSET with val3 specified as FUTEX_BITSET_MATCH_ANY. Okay? Yes. Thanks, Thomas. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
Hi Eric, On 06/23/2016 09:04 PM, Eric W. Biederman wrote: "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes: Hi Eric, On 06/21/2016 09:55 PM, Eric W. Biederman wrote: Hmm. When I gave this level of detail about the user namespace permission checks you gave me some flack, because it was not particularly comprehensible to the end users. I think you deserve the same feedback. How do we say this in a way that does not describes a useful way to think about it. I read this and I know a lot of what is going on and my mind goes numb. How about something like this: If the callers uid and gid are the same as a processes uids and gids and the processes is configured to allow core dumps (aka it was never setuid or setgid) then the caller is allowed to ptrace a process. Otherwise the caller must have CAP_SYS_PTRACE. Linux security modules impose additional restrictions. For consistency access to various process attributes are guarded with the same security checks as the ptrace system call itself. As they are all methods to get information about a process. We certainly need something that gives a high level view so people reading the man page can know what to expect. If you get down into the weeds we run the danger of people beginning to think they can depend upon bugs in the implementation. Thanks for the feedback, but I think more detail is required than you suggest. (And I added all of that detail somewhat reluctantly.) See my other replies for my rationale. What I saw badly missing from your description is not the level of detail but bring things into a form that ordinary mortals can understand. For an explanation to be clear I think we very much need the high level overview first. Then we can expand that description with the very detailed view. I very much think we need to describe things in such a way that people understand the principles behind the permission checks, and not just have the documentation echo the code, so that people can know what weird things LSMs like yama are likely to do, and how these checks are likely to evolve in the future. So, I completely agree with you, and I agree that this could be better. At first, I understood your meaning to be that I should avoid all of the detail, and just limit the man page to some very high level text as you proposed. So, I think it's worth prefixing the details with some attempt at a high-level picture. How about this as an introductory paragraph: Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode" checks, whose outcome determines whether an operation is permitted (or, in a few cases, causes a "read" operation to return sanitized data). These checks are performed in cases where one process can inspect sensitive information about, or in some cases mod‐ ify the state of, another process. The checks are based on factors such as the credentials and capabilities of the two processes, whether or not the "target" process is dumpable, and the results of checks performed by any enabled Linux Security Module (LSM)—for example, SELinux, Yama, or Smack—and by the commoncap LSM (which is always invoked). ? Because one thing is clear to me. The evolution of these details is clearly not done, and will continue to change in the future. Maybe people will even write man page patches when that happens :-). Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
Hi Eric, On 06/23/2016 09:04 PM, Eric W. Biederman wrote: "Michael Kerrisk (man-pages)" writes: Hi Eric, On 06/21/2016 09:55 PM, Eric W. Biederman wrote: Hmm. When I gave this level of detail about the user namespace permission checks you gave me some flack, because it was not particularly comprehensible to the end users. I think you deserve the same feedback. How do we say this in a way that does not describes a useful way to think about it. I read this and I know a lot of what is going on and my mind goes numb. How about something like this: If the callers uid and gid are the same as a processes uids and gids and the processes is configured to allow core dumps (aka it was never setuid or setgid) then the caller is allowed to ptrace a process. Otherwise the caller must have CAP_SYS_PTRACE. Linux security modules impose additional restrictions. For consistency access to various process attributes are guarded with the same security checks as the ptrace system call itself. As they are all methods to get information about a process. We certainly need something that gives a high level view so people reading the man page can know what to expect. If you get down into the weeds we run the danger of people beginning to think they can depend upon bugs in the implementation. Thanks for the feedback, but I think more detail is required than you suggest. (And I added all of that detail somewhat reluctantly.) See my other replies for my rationale. What I saw badly missing from your description is not the level of detail but bring things into a form that ordinary mortals can understand. For an explanation to be clear I think we very much need the high level overview first. Then we can expand that description with the very detailed view. I very much think we need to describe things in such a way that people understand the principles behind the permission checks, and not just have the documentation echo the code, so that people can know what weird things LSMs like yama are likely to do, and how these checks are likely to evolve in the future. So, I completely agree with you, and I agree that this could be better. At first, I understood your meaning to be that I should avoid all of the detail, and just limit the man page to some very high level text as you proposed. So, I think it's worth prefixing the details with some attempt at a high-level picture. How about this as an introductory paragraph: Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode" checks, whose outcome determines whether an operation is permitted (or, in a few cases, causes a "read" operation to return sanitized data). These checks are performed in cases where one process can inspect sensitive information about, or in some cases mod‐ ify the state of, another process. The checks are based on factors such as the credentials and capabilities of the two processes, whether or not the "target" process is dumpable, and the results of checks performed by any enabled Linux Security Module (LSM)—for example, SELinux, Yama, or Smack—and by the commoncap LSM (which is always invoked). ? Because one thing is clear to me. The evolution of these details is clearly not done, and will continue to change in the future. Maybe people will even write man page patches when that happens :-). Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
On 06/22/2016 11:11 PM, Kees Cook wrote: On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages) <mtk.manpa...@gmail.com> wrote: On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: 5. The kernel LSM security_ptrace_access_check() interface is invoked to see if ptrace access is permitted. The results depend on the LSM. The implementation of this interface in the default LSM performs the following steps: For people who are unaware of how the LSM API works, it might be good to clarify that the commoncap LSM is *always* invoked; otherwise, it might give the impression that using another LSM would replace it. As we can see, I am one of those who are unaware of how the LSM API works :-/. (Also, are there other documents that refer to it as "default LSM"? I think that that term is slightly confusing.) No, that's a terminological confusion of my own making. Fixed now. I changed this text to: Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by any enabled Linux Security Module (LSMs)—for example, SELinux, Yama, or Smack—and by the the commoncap LSM (which is always invoked). Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: BTW, can you point me at the piece(s) of kernel code that show that "commoncap" is always invoked in addition to any other LSM that has been installed? It's not entirely obvious, but the bottom of security/commoncap.c shows: #ifdef CONFIG_SECURITY struct security_hook_list capability_hooks[] = { LSM_HOOK_INIT(capable, cap_capable), ... }; void __init capability_add_hooks(void) { security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks)); } #endif And security/security.c shows the initialization order of the LSMs: int __init security_init(void) { pr_info("Security Framework initialized\n"); /* * Load minor LSMs, with the capability module always first. */ capability_add_hooks(); yama_add_hooks(); loadpin_add_hooks(); /* * Load all the remaining security modules. */ do_security_initcalls(); return 0; } So, I just want to check my understanding of a couple of points: 1. The commoncap LSM is invoked first, and if it denies access, then no further LSM is/needs to be called. 2. Is it the case that only one of the other LSMs (SELinux, Yama, AppArmor, etc.) is invoked, or can more than one be invoked. I thought only one is invoked, but perhaps I am out of date in my understanding. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
On 06/22/2016 11:11 PM, Kees Cook wrote: On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages) wrote: On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: 5. The kernel LSM security_ptrace_access_check() interface is invoked to see if ptrace access is permitted. The results depend on the LSM. The implementation of this interface in the default LSM performs the following steps: For people who are unaware of how the LSM API works, it might be good to clarify that the commoncap LSM is *always* invoked; otherwise, it might give the impression that using another LSM would replace it. As we can see, I am one of those who are unaware of how the LSM API works :-/. (Also, are there other documents that refer to it as "default LSM"? I think that that term is slightly confusing.) No, that's a terminological confusion of my own making. Fixed now. I changed this text to: Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by any enabled Linux Security Module (LSMs)—for example, SELinux, Yama, or Smack—and by the the commoncap LSM (which is always invoked). Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: BTW, can you point me at the piece(s) of kernel code that show that "commoncap" is always invoked in addition to any other LSM that has been installed? It's not entirely obvious, but the bottom of security/commoncap.c shows: #ifdef CONFIG_SECURITY struct security_hook_list capability_hooks[] = { LSM_HOOK_INIT(capable, cap_capable), ... }; void __init capability_add_hooks(void) { security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks)); } #endif And security/security.c shows the initialization order of the LSMs: int __init security_init(void) { pr_info("Security Framework initialized\n"); /* * Load minor LSMs, with the capability module always first. */ capability_add_hooks(); yama_add_hooks(); loadpin_add_hooks(); /* * Load all the remaining security modules. */ do_security_initcalls(); return 0; } So, I just want to check my understanding of a couple of points: 1. The commoncap LSM is invoked first, and if it denies access, then no further LSM is/needs to be called. 2. Is it the case that only one of the other LSMs (SELinux, Yama, AppArmor, etc.) is invoked, or can more than one be invoked. I thought only one is invoked, but perhaps I am out of date in my understanding. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
Stephen, On 06/23/2016 08:05 PM, Stephen Smalley wrote: On 06/21/2016 05:41 AM, Michael Kerrisk (man-pages) wrote: Hi Jann, Stephen, et al. Jann, since you recently committed a patch in this area, and Stephen, since you committed 006ebb40d3d much further back in time, I wonder if you might help me by reviewing the text below that I propose to add to the ptrace(2) man page, in order to document "ptrace access mode checking" that is performed in various parts of the kernel-user-space interface. Of course, I welcome input from anyone else as well. Here's the new ptrace(2) text. Any comments, technical or terminological fixes, other improvements, etc. are welcome. [[ Ptrace access mode checking Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by Linux Security Modules (LSMs) such as SELinux, Yama, Smack, or the default LSM. Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: PTRACE_MODE_READ For "read" operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/[pid]/auxv, /proc/[pid]/environ,or /proc/[pid]/stat; or readlink(2) of a /proc/[pid]/ns/* file. PTRACE_MODE_ATTACH For "write" operations, or other operations that are moredangerous,suchas:ptraceattaching (PTRACE_ATTACH)to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effec‐ tively the default before Linux 2.6.27.) That was the intent when the distinction was introduced, but it doesn't appear to have been properly maintained, e.g. there is now a common helper lock_trace() that is used for /proc/pid/{stack,syscall,personality} but checks PTRACE_MODE_ATTACH, and PTRACE_MODE_ATTACH is also used in timerslack_ns_write/show(). Likely should review and make them consistent. There was also some debate about proper handling of /proc/pid/fd. Arguably that one might belong back in the _ATTACH camp. Thanks for the background info. Since Linux 4.5, the above access mode checks may be combined (ORed) with one of the following modifiers: PTRACE_MODE_FSCREDS Use the caller's filesystem UID and GID (see creden‐ tials(7)) or effective capabilities for LSM checks. PTRACE_MODE_REALCREDS Use the caller's real UID and GID or permitted capabili‐ ties for LSM checks. This was effectively the default before Linux 4.5. Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations: PTRACE_MODE_READ_FSCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS. PTRACE_MODE_READ_REALCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS. PTRACE_MODE_ATTACH_FSCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS. PTRACE_MODE_ATTACH_REALCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS. One further modifier can be ORed with the access mode: PTRACE_MODE_NOAUDIT (since Linux 3.3) Don't audit this access mode check. [I'd quite welcome some text to explain "auditing" here.] Some ptrace access mode checks, such as checks when reading /proc/pid/stat, merely cause the output to be filtered/sanitized rather than an error to be returned to the caller. In these cases, accessing the file is not a security violation and there is no reason to generate a security audit record. This modifier suppresses the generation of such an audit record for the particular access check. Thanks, I've added that text to the man page more or less as you gave it here. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
Stephen, On 06/23/2016 08:05 PM, Stephen Smalley wrote: On 06/21/2016 05:41 AM, Michael Kerrisk (man-pages) wrote: Hi Jann, Stephen, et al. Jann, since you recently committed a patch in this area, and Stephen, since you committed 006ebb40d3d much further back in time, I wonder if you might help me by reviewing the text below that I propose to add to the ptrace(2) man page, in order to document "ptrace access mode checking" that is performed in various parts of the kernel-user-space interface. Of course, I welcome input from anyone else as well. Here's the new ptrace(2) text. Any comments, technical or terminological fixes, other improvements, etc. are welcome. [[ Ptrace access mode checking Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by Linux Security Modules (LSMs) such as SELinux, Yama, Smack, or the default LSM. Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: PTRACE_MODE_READ For "read" operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/[pid]/auxv, /proc/[pid]/environ,or /proc/[pid]/stat; or readlink(2) of a /proc/[pid]/ns/* file. PTRACE_MODE_ATTACH For "write" operations, or other operations that are moredangerous,suchas:ptraceattaching (PTRACE_ATTACH)to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effec‐ tively the default before Linux 2.6.27.) That was the intent when the distinction was introduced, but it doesn't appear to have been properly maintained, e.g. there is now a common helper lock_trace() that is used for /proc/pid/{stack,syscall,personality} but checks PTRACE_MODE_ATTACH, and PTRACE_MODE_ATTACH is also used in timerslack_ns_write/show(). Likely should review and make them consistent. There was also some debate about proper handling of /proc/pid/fd. Arguably that one might belong back in the _ATTACH camp. Thanks for the background info. Since Linux 4.5, the above access mode checks may be combined (ORed) with one of the following modifiers: PTRACE_MODE_FSCREDS Use the caller's filesystem UID and GID (see creden‐ tials(7)) or effective capabilities for LSM checks. PTRACE_MODE_REALCREDS Use the caller's real UID and GID or permitted capabili‐ ties for LSM checks. This was effectively the default before Linux 4.5. Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations: PTRACE_MODE_READ_FSCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS. PTRACE_MODE_READ_REALCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS. PTRACE_MODE_ATTACH_FSCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS. PTRACE_MODE_ATTACH_REALCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS. One further modifier can be ORed with the access mode: PTRACE_MODE_NOAUDIT (since Linux 3.3) Don't audit this access mode check. [I'd quite welcome some text to explain "auditing" here.] Some ptrace access mode checks, such as checks when reading /proc/pid/stat, merely cause the output to be filtered/sanitized rather than an error to be returned to the caller. In these cases, accessing the file is not a security violation and there is no reason to generate a security audit record. This modifier suppresses the generation of such an audit record for the particular access check. Thanks, I've added that text to the man page more or less as you gave it here. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
On 06/23/2016 08:56 PM, Eric W. Biederman wrote: "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes: Hi Oleg, On 06/22/2016 11:51 PM, Oleg Nesterov wrote: On 06/21, Eric W. Biederman wrote: Adding Oleg just because he seems to do most of the ptrace related maintenance these days. so I have to admit that I never even tried to actually understand ptrace_may_access ;) We certainly need something that gives a high level view so people reading the man page can know what to expect. If you get down into the weeds we run the danger of people beginning to think they can depend upon bugs in the implementation. Personally I agree. I think "man ptrace" shouldn't not tell too much about kernel internals. See my other replies on this topic. Somehow, we need a way of describing the behavior that user-space sees. I think it's inevitable that that means talking about what;s going on "under the hood". Regarding Eric's point that "we run the danger of people beginning to think they can depend upon bugs in the implementation": when it comes to breaking the ABI, the presence or absence of documentation doesn't save us on that point (Linus has a few times made his position wrt to documentation clear). Which are interesting in this respect as a bug in the implementation that is a security issue can and will be changed, even if userspace breaks. Breaking userspace is not desirable but when there is no other reasonable choice it will happen. Yes, good point. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
On 06/23/2016 08:56 PM, Eric W. Biederman wrote: "Michael Kerrisk (man-pages)" writes: Hi Oleg, On 06/22/2016 11:51 PM, Oleg Nesterov wrote: On 06/21, Eric W. Biederman wrote: Adding Oleg just because he seems to do most of the ptrace related maintenance these days. so I have to admit that I never even tried to actually understand ptrace_may_access ;) We certainly need something that gives a high level view so people reading the man page can know what to expect. If you get down into the weeds we run the danger of people beginning to think they can depend upon bugs in the implementation. Personally I agree. I think "man ptrace" shouldn't not tell too much about kernel internals. See my other replies on this topic. Somehow, we need a way of describing the behavior that user-space sees. I think it's inevitable that that means talking about what;s going on "under the hood". Regarding Eric's point that "we run the danger of people beginning to think they can depend upon bugs in the implementation": when it comes to breaking the ABI, the presence or absence of documentation doesn't save us on that point (Linus has a few times made his position wrt to documentation clear). Which are interesting in this respect as a bug in the implementation that is a security issue can and will be changed, even if userspace breaks. Breaking userspace is not desirable but when there is no other reasonable choice it will happen. Yes, good point. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
On 06/23/2016 09:53 PM, Darren Hart wrote: On Thu, Jun 23, 2016 at 08:35:15PM +0200, Michael Kerrisk (man-pages) wrote: Hi Darren, On 06/23/2016 06:16 PM, Darren Hart wrote: On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote: On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote: On 06/23/2016 09:18 AM, Thomas Gleixner wrote: Once upon a time, you told me the following: On 15 May 2014 at 16:14, Thomas Gleixner <t...@linutronix.de> wrote: On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote: And that universe would love to have your documentation of FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-), I give you almost the full treatment, but I leave REQUEUE_PI to Darren and FUTEX_WAKE_OP to Jakub. :) [...] FUTEX_CLOCK_REALTIME This option bit can be ored on the futex ops FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI If set the kernel treats the user space supplied timeout as absolute time based on CLOCK_REALTIME. If not set the kernel treats the user space supplied timeout as relative time. Unfortunately, I should have checked the code more carefully... Me too :) Seems to be going around... Looking more carefully at the code, I see understand the situation is the following: FUTEX_LOCK_PI Always uses CLOCK_REALTIME 'timeout' is absolute Yes. FUTEX_WAIT_REQUEUE_PI Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute Yes FUTEX_WAIT_BITSET Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute Yes FUTEX_WAIT Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is relative Yes. I've amended the man page to describe those details. OK, that confirms my question, timeout interpretation as relative or absolute is based on the op code, not the CLOCK flag. The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time. When you say that the "flag was added", which flag do you mean? Or, did you mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute time". I didn't express myself clearly. When Darren added the support for CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout support. Anything else does not make sense. I sent that patch because reading the new man page it struck me as strange that FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not, especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to ALL. I didn't realize the impact to relative/absolute interpretation of the timeout value at the time. I think it was a mistake to introduce a change that made FUTEX_WAIT interpret the timeout differently based on the CLOCK flag, I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT always interprets the clock as relative, regardless of presence/absence of FUTEX_CLOCK_REALTIME? Am I missing something? No you're not. The code as it stands today is always relative, but it gets the base time from the wrong clock source in the case of FUTEX_CLOCK_REALTIME. Ahh yes, I'd clicked to that, but forgot to say so. I was stating that I think it would be a mistake to add absolute timeout to FUTEX_WAIT based on the FUTEX_CLOCK_REALTIME flag, which is how Thomas describes above his interpretation of my earlier change. Got it now. Thanks for the clarification, Darren. Cheers Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
On 06/23/2016 09:53 PM, Darren Hart wrote: On Thu, Jun 23, 2016 at 08:35:15PM +0200, Michael Kerrisk (man-pages) wrote: Hi Darren, On 06/23/2016 06:16 PM, Darren Hart wrote: On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote: On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote: On 06/23/2016 09:18 AM, Thomas Gleixner wrote: Once upon a time, you told me the following: On 15 May 2014 at 16:14, Thomas Gleixner wrote: On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote: And that universe would love to have your documentation of FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-), I give you almost the full treatment, but I leave REQUEUE_PI to Darren and FUTEX_WAKE_OP to Jakub. :) [...] FUTEX_CLOCK_REALTIME This option bit can be ored on the futex ops FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI If set the kernel treats the user space supplied timeout as absolute time based on CLOCK_REALTIME. If not set the kernel treats the user space supplied timeout as relative time. Unfortunately, I should have checked the code more carefully... Me too :) Seems to be going around... Looking more carefully at the code, I see understand the situation is the following: FUTEX_LOCK_PI Always uses CLOCK_REALTIME 'timeout' is absolute Yes. FUTEX_WAIT_REQUEUE_PI Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute Yes FUTEX_WAIT_BITSET Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute Yes FUTEX_WAIT Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is relative Yes. I've amended the man page to describe those details. OK, that confirms my question, timeout interpretation as relative or absolute is based on the op code, not the CLOCK flag. The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time. When you say that the "flag was added", which flag do you mean? Or, did you mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute time". I didn't express myself clearly. When Darren added the support for CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout support. Anything else does not make sense. I sent that patch because reading the new man page it struck me as strange that FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not, especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to ALL. I didn't realize the impact to relative/absolute interpretation of the timeout value at the time. I think it was a mistake to introduce a change that made FUTEX_WAIT interpret the timeout differently based on the CLOCK flag, I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT always interprets the clock as relative, regardless of presence/absence of FUTEX_CLOCK_REALTIME? Am I missing something? No you're not. The code as it stands today is always relative, but it gets the base time from the wrong clock source in the case of FUTEX_CLOCK_REALTIME. Ahh yes, I'd clicked to that, but forgot to say so. I was stating that I think it would be a mistake to add absolute timeout to FUTEX_WAIT based on the FUTEX_CLOCK_REALTIME flag, which is how Thomas describes above his interpretation of my earlier change. Got it now. Thanks for the clarification, Darren. Cheers Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
On 06/23/2016 08:28 PM, Darren Hart wrote: On Thu, Jun 23, 2016 at 07:26:52PM +0200, Thomas Gleixner wrote: On Thu, 23 Jun 2016, Darren Hart wrote: On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote: In my opinion, we should treat the timeout value as relative for FUTEX_WAIT regardless of the CLOCK used. Which requires even more changes as you have to select which clock you are using for adding the base time. Right, something like the following? diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..c39d807 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,8 +3230,12 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) - t = ktime_add_safe(ktime_get(), t); + if (cmd == FUTEX_WAIT) { + if (cmd & FUTEX_CLOCK_REALTIME) + t = ktime_add_safe(ktime_get_real(), t); + else + t = ktime_add_safe(ktime_get(), t); + } tp = } /* Just in the interests of readability/maintainability, might it not make some sense to recode the timeout handling for FUTEX_WAIT within futex_wait(). I think that part of the reason we're in this mess of inconsistency is that timeout interpretation is being handled at too many different points in the code. And as a follow-on, what is the reason for FUTEX_LOCK_PI only using CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a specific amount of time, regardless of wall time. Yes, that's another weird inconsistency. Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
On 06/23/2016 08:28 PM, Darren Hart wrote: On Thu, Jun 23, 2016 at 07:26:52PM +0200, Thomas Gleixner wrote: On Thu, 23 Jun 2016, Darren Hart wrote: On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote: In my opinion, we should treat the timeout value as relative for FUTEX_WAIT regardless of the CLOCK used. Which requires even more changes as you have to select which clock you are using for adding the base time. Right, something like the following? diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..c39d807 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,8 +3230,12 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) - t = ktime_add_safe(ktime_get(), t); + if (cmd == FUTEX_WAIT) { + if (cmd & FUTEX_CLOCK_REALTIME) + t = ktime_add_safe(ktime_get_real(), t); + else + t = ktime_add_safe(ktime_get(), t); + } tp = } /* Just in the interests of readability/maintainability, might it not make some sense to recode the timeout handling for FUTEX_WAIT within futex_wait(). I think that part of the reason we're in this mess of inconsistency is that timeout interpretation is being handled at too many different points in the code. And as a follow-on, what is the reason for FUTEX_LOCK_PI only using CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a specific amount of time, regardless of wall time. Yes, that's another weird inconsistency. Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
Hi Darren, On 06/23/2016 06:16 PM, Darren Hart wrote: On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote: On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote: On 06/23/2016 09:18 AM, Thomas Gleixner wrote: Once upon a time, you told me the following: On 15 May 2014 at 16:14, Thomas Gleixner <t...@linutronix.de> wrote: On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote: And that universe would love to have your documentation of FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-), I give you almost the full treatment, but I leave REQUEUE_PI to Darren and FUTEX_WAKE_OP to Jakub. :) [...] FUTEX_CLOCK_REALTIME This option bit can be ored on the futex ops FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI If set the kernel treats the user space supplied timeout as absolute time based on CLOCK_REALTIME. If not set the kernel treats the user space supplied timeout as relative time. Unfortunately, I should have checked the code more carefully... Me too :) Seems to be going around... Looking more carefully at the code, I see understand the situation is the following: FUTEX_LOCK_PI Always uses CLOCK_REALTIME 'timeout' is absolute Yes. FUTEX_WAIT_REQUEUE_PI Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute Yes FUTEX_WAIT_BITSET Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute Yes FUTEX_WAIT Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is relative Yes. I've amended the man page to describe those details. OK, that confirms my question, timeout interpretation as relative or absolute is based on the op code, not the CLOCK flag. The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time. When you say that the "flag was added", which flag do you mean? Or, did you mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute time". I didn't express myself clearly. When Darren added the support for CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout support. Anything else does not make sense. I sent that patch because reading the new man page it struck me as strange that FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not, especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to ALL. I didn't realize the impact to relative/absolute interpretation of the timeout value at the time. I think it was a mistake to introduce a change that made FUTEX_WAIT interpret the timeout differently based on the CLOCK flag, I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT always interprets the clock as relative, regardless of presence/absence of FUTEX_CLOCK_REALTIME? Am I missing something? while that interpretation is independent of the CLOCK flag for all other op codes. In my opinion, we should treat the timeout value as relative for FUTEX_WAIT regardless of the CLOCK used. I realize it's historical, but it is really weird that FUTEX_WAIT interprets time timeout (relative vs absolute) differently from all of the other operations. That would require a change to the man page to eliminate the relative/absolute language in the FUTEX_CLOCK_REALTIME definition and explicit definitions of the interpretation for each op code (as Matthew explains above). Do we agree on that? Yes. The man page changes are already in Git. My earlier reply contained the commit ref: http://git.kernel.org/cgit/docs/man-pages/man-pages.git/commit/?id=8064bfa5369c6856f606004d02e48ab275e05bed Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
Hi Darren, On 06/23/2016 06:16 PM, Darren Hart wrote: On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote: On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote: On 06/23/2016 09:18 AM, Thomas Gleixner wrote: Once upon a time, you told me the following: On 15 May 2014 at 16:14, Thomas Gleixner wrote: On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote: And that universe would love to have your documentation of FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-), I give you almost the full treatment, but I leave REQUEUE_PI to Darren and FUTEX_WAKE_OP to Jakub. :) [...] FUTEX_CLOCK_REALTIME This option bit can be ored on the futex ops FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI If set the kernel treats the user space supplied timeout as absolute time based on CLOCK_REALTIME. If not set the kernel treats the user space supplied timeout as relative time. Unfortunately, I should have checked the code more carefully... Me too :) Seems to be going around... Looking more carefully at the code, I see understand the situation is the following: FUTEX_LOCK_PI Always uses CLOCK_REALTIME 'timeout' is absolute Yes. FUTEX_WAIT_REQUEUE_PI Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute Yes FUTEX_WAIT_BITSET Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute Yes FUTEX_WAIT Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is relative Yes. I've amended the man page to describe those details. OK, that confirms my question, timeout interpretation as relative or absolute is based on the op code, not the CLOCK flag. The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time. When you say that the "flag was added", which flag do you mean? Or, did you mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute time". I didn't express myself clearly. When Darren added the support for CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout support. Anything else does not make sense. I sent that patch because reading the new man page it struck me as strange that FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not, especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to ALL. I didn't realize the impact to relative/absolute interpretation of the timeout value at the time. I think it was a mistake to introduce a change that made FUTEX_WAIT interpret the timeout differently based on the CLOCK flag, I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT always interprets the clock as relative, regardless of presence/absence of FUTEX_CLOCK_REALTIME? Am I missing something? while that interpretation is independent of the CLOCK flag for all other op codes. In my opinion, we should treat the timeout value as relative for FUTEX_WAIT regardless of the CLOCK used. I realize it's historical, but it is really weird that FUTEX_WAIT interprets time timeout (relative vs absolute) differently from all of the other operations. That would require a change to the man page to eliminate the relative/absolute language in the FUTEX_CLOCK_REALTIME definition and explicit definitions of the interpretation for each op code (as Matthew explains above). Do we agree on that? Yes. The man page changes are already in Git. My earlier reply contained the commit ref: http://git.kernel.org/cgit/docs/man-pages/man-pages.git/commit/?id=8064bfa5369c6856f606004d02e48ab275e05bed Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
On 06/23/2016 09:18 AM, Thomas Gleixner wrote: On Wed, 22 Jun 2016, Darren Hart wrote: However, I don't think the patch below is correct. The existing logic determines the type of timeout based on the futex_op when it should instead determine the type of timeout based on the FUTEX_CLOCK_REALTIME flag. No. My reading of the man page is that FUTEX_WAIT_BITSET abides by the timeout interpretation defined by the FUTEX_CLOCK_REALTIME attribute, so SYSCALL_DEFINE6 was misbehaving for FUTEX_WAIT|FUTEX_CLOCK_REALTIME (where the timeout should have been treated as absolute) as well as for FUTEX_WAIT_BITSET|FUTEX_CLOCK_MONOTONIC (where the timeout should have been treated as relative). Consider the following: diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..fa2af29 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) + if (!(cmd & FUTEX_CLOCK_REALTIME)) t = ktime_add_safe(ktime_get(), t); That breaks LOCK_PI, REQUEUE_PI and FUTEX_WAIT_BITSET The concern for me is whether the code is incorrect, or if the man page is incorrect. Does existing userspace code expect the FUTEX_WAIT_BITSET op to always use an absolute timeout, regardless of the CLOCK used? FUTEX_WAIT_BITSET, LOCK_PI and REQUEUE_PI always expect absolute time in CLOCK_REALTIME independent of the CLOCK_REALTIME flag. Once upon a time, you told me the following: On 15 May 2014 at 16:14, Thomas Gleixner <t...@linutronix.de> wrote: On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote: And that universe would love to have your documentation of FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-), I give you almost the full treatment, but I leave REQUEUE_PI to Darren and FUTEX_WAKE_OP to Jakub. :) [...] FUTEX_CLOCK_REALTIME This option bit can be ored on the futex ops FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI If set the kernel treats the user space supplied timeout as absolute time based on CLOCK_REALTIME. If not set the kernel treats the user space supplied timeout as relative time. Unfortunately, I should have checked the code more carefully... Looking more carefully at the code, I see understand the situation is the following: FUTEX_LOCK_PI Always uses CLOCK_REALTIME 'timeout' is absolute FUTEX_WAIT_REQUEUE_PI Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute FUTEX_WAIT_BITSET Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute FUTEX_WAIT Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is relative Right? I've amended the man page to describe those details. The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time. When you say that the "flag was added", which flag do you mean? Or, did you mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute time". diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..4bee915 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) + if (cmd == FUTEX_WAIT && !(op & FUTEX_CLOCK_REALTIME)) t = ktime_add_safe(ktime_get(), t); tp = } So this patch is correct and if the man page is unclear about it then we need to fix that. So, my fixes to the man page just now are at http://git.kernel.org/cgit/docs/man-pages/man-pages.git/commit/?id=8064bfa5369c6856f606004d02e48ab275e05bed If Matthieu's patch is applied, obviously a further fix will be needed needed in the description of FUTEX_WAIT. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
Hi Darren, On 06/23/2016 06:48 AM, Darren Hart wrote: On Mon, Jun 20, 2016 at 04:26:52PM +0200, Matthieu CASTET wrote: Hi, the commit 337f13046ff03717a9e99675284a817527440a49 is saying that it change to syscall to an equivalent to FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME with a bitset of FUTEX_BITSET_MATCH_ANY. It seems wrong to me, because in case of FUTEX_WAIT, in "SYSCALL_DEFINE6(futex", we convert relative timeout to absolute timeout [1]. So FUTEX_CLOCK_REALTIME | FUTEX_WAIT is expecting a relative timeout when FUTEX_WAIT_BITSET take an absolute timeout. To make it work you have to use something like the (untested) attached patch. +Eric Dumazet Thanks for reporting Matthieu, FUTEX_WAIT traditionally used a relative timeout with CLOCK_MONOTONIC while FUTEX_WAIT_BITSET could use either ??? based on the FUTEX_CLOCK_ flag used. The man page is not particularly clear on this: http://man7.org/linux/man-pages/man2/futex.2.html " The FUTEX_WAIT_BITSET operation also interprets the timeout argument differently from FUTEX_WAIT. See the discussion of FUTEX_CLOCK_REALTIME, above. " Matthew Kerrisk: I think this language could be removed now that we support the FUTEX_CLOCK_REALTIME flag for both futex ops. Done. As for the intended behavior of the FUTEX_CLOCK_REALTIME flag: FUTEX_CLOCK_REALTIME (since Linux 2.6.28) This option bit can be employed only with the FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI, and FUTEX_WAIT (since Linux 4.5) operations. (NOTE: FUTEX_WAIT was recently added after the patch in question here) If this option is set, the kernel treats timeout as an absolute time based on CLOCK_REALTIME. If this option is not set, the kernel treats timeout as a relative time, measured against the CLOCK_MONOTONIC clock. This supports your argument Matthieu. The assumption of a relative timeout for FUTEX_WAIT in SYSCALL_DEFINE6 needs to be updated to account for FUTEX_WAIT now honoring the FUTEX_CLOCK_REALTIME flag, which treats the timeout as absolute. However, I don't think the patch below is correct. The existing logic determines the type of timeout based on the futex_op when it should instead determine the type of timeout based on the FUTEX_CLOCK_REALTIME flag. My reading of the man page is that FUTEX_WAIT_BITSET abides by the timeout interpretation defined by the FUTEX_CLOCK_REALTIME attribute, so SYSCALL_DEFINE6 was misbehaving for FUTEX_WAIT|FUTEX_CLOCK_REALTIME (where the timeout should have been treated as absolute) as well as for FUTEX_WAIT_BITSET|FUTEX_CLOCK_MONOTONIC (where the timeout should have been treated as relative). Consider the following: diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..fa2af29 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) + if (!(cmd & FUTEX_CLOCK_REALTIME)) t = ktime_add_safe(ktime_get(), t); tp = } The concern for me is whether the code is incorrect, or if the man page is incorrect. Does existing userspace code expect the FUTEX_WAIT_BITSET op to always use an absolute timeout, regardless of the CLOCK used? So, there clearly seem to be some things broken in the man page. See the reply I sent to tglx. Cheers, Michael [1] if (cmd == FUTEX_WAIT) t = ktime_add_safe(ktime_get(), t); diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..4bee915 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) + if (cmd == FUTEX_WAIT && !(op & FUTEX_CLOCK_REALTIME)) t = ktime_add_safe(ktime_get(), t); tp = } -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
On 06/23/2016 09:18 AM, Thomas Gleixner wrote: On Wed, 22 Jun 2016, Darren Hart wrote: However, I don't think the patch below is correct. The existing logic determines the type of timeout based on the futex_op when it should instead determine the type of timeout based on the FUTEX_CLOCK_REALTIME flag. No. My reading of the man page is that FUTEX_WAIT_BITSET abides by the timeout interpretation defined by the FUTEX_CLOCK_REALTIME attribute, so SYSCALL_DEFINE6 was misbehaving for FUTEX_WAIT|FUTEX_CLOCK_REALTIME (where the timeout should have been treated as absolute) as well as for FUTEX_WAIT_BITSET|FUTEX_CLOCK_MONOTONIC (where the timeout should have been treated as relative). Consider the following: diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..fa2af29 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) + if (!(cmd & FUTEX_CLOCK_REALTIME)) t = ktime_add_safe(ktime_get(), t); That breaks LOCK_PI, REQUEUE_PI and FUTEX_WAIT_BITSET The concern for me is whether the code is incorrect, or if the man page is incorrect. Does existing userspace code expect the FUTEX_WAIT_BITSET op to always use an absolute timeout, regardless of the CLOCK used? FUTEX_WAIT_BITSET, LOCK_PI and REQUEUE_PI always expect absolute time in CLOCK_REALTIME independent of the CLOCK_REALTIME flag. Once upon a time, you told me the following: On 15 May 2014 at 16:14, Thomas Gleixner wrote: On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote: And that universe would love to have your documentation of FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-), I give you almost the full treatment, but I leave REQUEUE_PI to Darren and FUTEX_WAKE_OP to Jakub. :) [...] FUTEX_CLOCK_REALTIME This option bit can be ored on the futex ops FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI If set the kernel treats the user space supplied timeout as absolute time based on CLOCK_REALTIME. If not set the kernel treats the user space supplied timeout as relative time. Unfortunately, I should have checked the code more carefully... Looking more carefully at the code, I see understand the situation is the following: FUTEX_LOCK_PI Always uses CLOCK_REALTIME 'timeout' is absolute FUTEX_WAIT_REQUEUE_PI Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute FUTEX_WAIT_BITSET Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is absolute FUTEX_WAIT Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is determined by presence or absence of FUTEX_CLOCK_REALTIME flag 'timeout' is relative Right? I've amended the man page to describe those details. The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time. When you say that the "flag was added", which flag do you mean? Or, did you mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute time". diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..4bee915 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) + if (cmd == FUTEX_WAIT && !(op & FUTEX_CLOCK_REALTIME)) t = ktime_add_safe(ktime_get(), t); tp = } So this patch is correct and if the man page is unclear about it then we need to fix that. So, my fixes to the man page just now are at http://git.kernel.org/cgit/docs/man-pages/man-pages.git/commit/?id=8064bfa5369c6856f606004d02e48ab275e05bed If Matthieu's patch is applied, obviously a further fix will be needed needed in the description of FUTEX_WAIT. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
Hi Darren, On 06/23/2016 06:48 AM, Darren Hart wrote: On Mon, Jun 20, 2016 at 04:26:52PM +0200, Matthieu CASTET wrote: Hi, the commit 337f13046ff03717a9e99675284a817527440a49 is saying that it change to syscall to an equivalent to FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME with a bitset of FUTEX_BITSET_MATCH_ANY. It seems wrong to me, because in case of FUTEX_WAIT, in "SYSCALL_DEFINE6(futex", we convert relative timeout to absolute timeout [1]. So FUTEX_CLOCK_REALTIME | FUTEX_WAIT is expecting a relative timeout when FUTEX_WAIT_BITSET take an absolute timeout. To make it work you have to use something like the (untested) attached patch. +Eric Dumazet Thanks for reporting Matthieu, FUTEX_WAIT traditionally used a relative timeout with CLOCK_MONOTONIC while FUTEX_WAIT_BITSET could use either ??? based on the FUTEX_CLOCK_ flag used. The man page is not particularly clear on this: http://man7.org/linux/man-pages/man2/futex.2.html " The FUTEX_WAIT_BITSET operation also interprets the timeout argument differently from FUTEX_WAIT. See the discussion of FUTEX_CLOCK_REALTIME, above. " Matthew Kerrisk: I think this language could be removed now that we support the FUTEX_CLOCK_REALTIME flag for both futex ops. Done. As for the intended behavior of the FUTEX_CLOCK_REALTIME flag: FUTEX_CLOCK_REALTIME (since Linux 2.6.28) This option bit can be employed only with the FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI, and FUTEX_WAIT (since Linux 4.5) operations. (NOTE: FUTEX_WAIT was recently added after the patch in question here) If this option is set, the kernel treats timeout as an absolute time based on CLOCK_REALTIME. If this option is not set, the kernel treats timeout as a relative time, measured against the CLOCK_MONOTONIC clock. This supports your argument Matthieu. The assumption of a relative timeout for FUTEX_WAIT in SYSCALL_DEFINE6 needs to be updated to account for FUTEX_WAIT now honoring the FUTEX_CLOCK_REALTIME flag, which treats the timeout as absolute. However, I don't think the patch below is correct. The existing logic determines the type of timeout based on the futex_op when it should instead determine the type of timeout based on the FUTEX_CLOCK_REALTIME flag. My reading of the man page is that FUTEX_WAIT_BITSET abides by the timeout interpretation defined by the FUTEX_CLOCK_REALTIME attribute, so SYSCALL_DEFINE6 was misbehaving for FUTEX_WAIT|FUTEX_CLOCK_REALTIME (where the timeout should have been treated as absolute) as well as for FUTEX_WAIT_BITSET|FUTEX_CLOCK_MONOTONIC (where the timeout should have been treated as relative). Consider the following: diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..fa2af29 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) + if (!(cmd & FUTEX_CLOCK_REALTIME)) t = ktime_add_safe(ktime_get(), t); tp = } The concern for me is whether the code is incorrect, or if the man page is incorrect. Does existing userspace code expect the FUTEX_WAIT_BITSET op to always use an absolute timeout, regardless of the CLOCK used? So, there clearly seem to be some things broken in the man page. See the reply I sent to tglx. Cheers, Michael [1] if (cmd == FUTEX_WAIT) t = ktime_add_safe(ktime_get(), t); diff --git a/kernel/futex.c b/kernel/futex.c index 33664f7..4bee915 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec_to_ktime(ts); - if (cmd == FUTEX_WAIT) + if (cmd == FUTEX_WAIT && !(op & FUTEX_CLOCK_REALTIME)) t = ktime_add_safe(ktime_get(), t); tp = } -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
Hi Jann, Thanks for your further review. Follow-up of one point below. On 06/23/2016 12:44 AM, Jann Horn wrote: On Wed, Jun 22, 2016 at 09:21:29PM +0200, Michael Kerrisk (man-pages) wrote: On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: [...] The algorithm employed for ptrace access mode checking deter‐ mines whether the calling process is allowed to perform the corresponding action on the target process, as follows: 1. If the calling thread and the target thread are in the same thread group, access is always allowed. 2. If the access mode specifies PTRACE_MODE_FSCREDS, then for the check in the next step, employ the caller's filesystem user ID and group ID (see credentials(7)); otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller's real user ID and group ID. Might want to add a "for historical reasons" or so here. Can you be a little more precise about "here", and maybe tell me why you think it helps? I'm not sure, but it might be a good idea to add something like this at the end of 2.: "(Most other APIs that check one of the caller's UIDs use the effective one. This API uses the real UID instead for historical reasons.)" In my opinion, it is inconsistent to use the real UID/GID here, the effective one would be more appropriate. But since the existing code uses the real UID/GID and that's not a security issue for existing users of the ptrace API, this wasn't changed when I added the REALCREDS/FSCREDS distinction. I think that for a reader, it might help to point out that in most cases, when a process is the subject in an access check, its effective UID/GID are used, and this is (together with kill()) an exception to that rule. But you're the expert on writing documentation, if you think that that's too much detail / confusing here, it probably is. Okay -- got it now, I think. I made this text: 2. If the access mode specifies PTRACE_MODE_FSCREDS, then, for the check in the next step, employ the caller's filesystem UID and GID. (As noted in credentials(7), the filesystem UID and GID almost always have the same values as the cor‐ responding effective IDs.) Otherwise, the access mode specifies PTRACE_MODE_REALCREDS, so use the caller's real UID and GID for the checks in the next step. (Most APIs that check the caller's UID and GID use the effective IDs. For historical reasons, the PTRACE_MODE_REALCREDS check uses the real IDs instead.) [...] Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
Hi Jann, Thanks for your further review. Follow-up of one point below. On 06/23/2016 12:44 AM, Jann Horn wrote: On Wed, Jun 22, 2016 at 09:21:29PM +0200, Michael Kerrisk (man-pages) wrote: On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: [...] The algorithm employed for ptrace access mode checking deter‐ mines whether the calling process is allowed to perform the corresponding action on the target process, as follows: 1. If the calling thread and the target thread are in the same thread group, access is always allowed. 2. If the access mode specifies PTRACE_MODE_FSCREDS, then for the check in the next step, employ the caller's filesystem user ID and group ID (see credentials(7)); otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller's real user ID and group ID. Might want to add a "for historical reasons" or so here. Can you be a little more precise about "here", and maybe tell me why you think it helps? I'm not sure, but it might be a good idea to add something like this at the end of 2.: "(Most other APIs that check one of the caller's UIDs use the effective one. This API uses the real UID instead for historical reasons.)" In my opinion, it is inconsistent to use the real UID/GID here, the effective one would be more appropriate. But since the existing code uses the real UID/GID and that's not a security issue for existing users of the ptrace API, this wasn't changed when I added the REALCREDS/FSCREDS distinction. I think that for a reader, it might help to point out that in most cases, when a process is the subject in an access check, its effective UID/GID are used, and this is (together with kill()) an exception to that rule. But you're the expert on writing documentation, if you think that that's too much detail / confusing here, it probably is. Okay -- got it now, I think. I made this text: 2. If the access mode specifies PTRACE_MODE_FSCREDS, then, for the check in the next step, employ the caller's filesystem UID and GID. (As noted in credentials(7), the filesystem UID and GID almost always have the same values as the cor‐ responding effective IDs.) Otherwise, the access mode specifies PTRACE_MODE_REALCREDS, so use the caller's real UID and GID for the checks in the next step. (Most APIs that check the caller's UID and GID use the effective IDs. For historical reasons, the PTRACE_MODE_REALCREDS check uses the real IDs instead.) [...] Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
Hi Oleg, On 06/22/2016 11:51 PM, Oleg Nesterov wrote: On 06/21, Eric W. Biederman wrote: Adding Oleg just because he seems to do most of the ptrace related maintenance these days. so I have to admit that I never even tried to actually understand ptrace_may_access ;) We certainly need something that gives a high level view so people reading the man page can know what to expect. If you get down into the weeds we run the danger of people beginning to think they can depend upon bugs in the implementation. Personally I agree. I think "man ptrace" shouldn't not tell too much about kernel internals. See my other replies on this topic. Somehow, we need a way of describing the behavior that user-space sees. I think it's inevitable that that means talking about what;s going on "under the hood". Regarding Eric's point that "we run the danger of people beginning to think they can depend upon bugs in the implementation": when it comes to breaking the ABI, the presence or absence of documentation doesn't save us on that point (Linus has a few times made his position wrt to documentation clear). Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
Hi Oleg, On 06/22/2016 11:51 PM, Oleg Nesterov wrote: On 06/21, Eric W. Biederman wrote: Adding Oleg just because he seems to do most of the ptrace related maintenance these days. so I have to admit that I never even tried to actually understand ptrace_may_access ;) We certainly need something that gives a high level view so people reading the man page can know what to expect. If you get down into the weeds we run the danger of people beginning to think they can depend upon bugs in the implementation. Personally I agree. I think "man ptrace" shouldn't not tell too much about kernel internals. See my other replies on this topic. Somehow, we need a way of describing the behavior that user-space sees. I think it's inevitable that that means talking about what;s going on "under the hood". Regarding Eric's point that "we run the danger of people beginning to think they can depend upon bugs in the implementation": when it comes to breaking the ABI, the presence or absence of documentation doesn't save us on that point (Linus has a few times made his position wrt to documentation clear). Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
On 06/22/2016 11:11 PM, Kees Cook wrote: On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages) <mtk.manpa...@gmail.com> wrote: On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: 5. The kernel LSM security_ptrace_access_check() interface is invoked to see if ptrace access is permitted. The results depend on the LSM. The implementation of this interface in the default LSM performs the following steps: For people who are unaware of how the LSM API works, it might be good to clarify that the commoncap LSM is *always* invoked; otherwise, it might give the impression that using another LSM would replace it. As we can see, I am one of those who are unaware of how the LSM API works :-/. (Also, are there other documents that refer to it as "default LSM"? I think that that term is slightly confusing.) No, that's a terminological confusion of my own making. Fixed now. I changed this text to: Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by any enabled Linux Security Module (LSMs)—for example, SELinux, Yama, or Smack—and by the the commoncap LSM (which is always invoked). Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: BTW, can you point me at the piece(s) of kernel code that show that "commoncap" is always invoked in addition to any other LSM that has been installed? It's not entirely obvious, but the bottom of security/commoncap.c shows: Thanks Kees! Cheers, Michael #ifdef CONFIG_SECURITY struct security_hook_list capability_hooks[] = { LSM_HOOK_INIT(capable, cap_capable), ... }; void __init capability_add_hooks(void) { security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks)); } #endif And security/security.c shows the initialization order of the LSMs: int __init security_init(void) { pr_info("Security Framework initialized\n"); /* * Load minor LSMs, with the capability module always first. */ capability_add_hooks(); yama_add_hooks(); loadpin_add_hooks(); /* * Load all the remaining security modules. */ do_security_initcalls(); return 0; } -Kees -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
On 06/22/2016 11:11 PM, Kees Cook wrote: On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages) wrote: On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: 5. The kernel LSM security_ptrace_access_check() interface is invoked to see if ptrace access is permitted. The results depend on the LSM. The implementation of this interface in the default LSM performs the following steps: For people who are unaware of how the LSM API works, it might be good to clarify that the commoncap LSM is *always* invoked; otherwise, it might give the impression that using another LSM would replace it. As we can see, I am one of those who are unaware of how the LSM API works :-/. (Also, are there other documents that refer to it as "default LSM"? I think that that term is slightly confusing.) No, that's a terminological confusion of my own making. Fixed now. I changed this text to: Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by any enabled Linux Security Module (LSMs)—for example, SELinux, Yama, or Smack—and by the the commoncap LSM (which is always invoked). Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: BTW, can you point me at the piece(s) of kernel code that show that "commoncap" is always invoked in addition to any other LSM that has been installed? It's not entirely obvious, but the bottom of security/commoncap.c shows: Thanks Kees! Cheers, Michael #ifdef CONFIG_SECURITY struct security_hook_list capability_hooks[] = { LSM_HOOK_INIT(capable, cap_capable), ... }; void __init capability_add_hooks(void) { security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks)); } #endif And security/security.c shows the initialization order of the LSMs: int __init security_init(void) { pr_info("Security Framework initialized\n"); /* * Load minor LSMs, with the capability module always first. */ capability_add_hooks(); yama_add_hooks(); loadpin_add_hooks(); /* * Load all the remaining security modules. */ do_security_initcalls(); return 0; } -Kees -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Documenting ptrace access mode checking
Hi Jann, On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: Hi Jann, Stephen, et al. Jann, since you recently committed a patch in this area, and Stephen, since you committed 006ebb40d3d much further back in time, I wonder if you might help me by reviewing the text below that I propose to add to the ptrace(2) man page, in order to document "ptrace access mode checking" that is performed in various parts of the kernel-user-space interface. Of course, I welcome input from anyone else as well. Here's the new ptrace(2) text. Any comments, technical or terminological fixes, other improvements, etc. are welcome. As others have said, I'm surprised about seeing documentation about kernel-internal constants in manpages - but I think it might be a good thing to have there, given that people who look at ptrace(2) are likely to be interested in low-level details. I agree that it is a little surprising to add kernel-internal constants in a man page. (There are precedents, but they are few.) But see my reply to Kees. It's more than just explaining low level details: there are various kinds of user-space behavior differences (real vs filesystem credentials; permitted vs effective capabilities) produced by the ptrace_may_access() checks, and those behaviors need to be described and *somehow* labeled for cross-referencing from other man pages. [[ Ptrace access mode checking Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by Linux Security Modules (LSMs) such as SELinux, Yama, Smack, or the default LSM. Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: PTRACE_MODE_READ For "read" operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/[pid]/auxv, /proc/[pid]/environ,or /proc/[pid]/stat; or readlink(2) of a /proc/[pid]/ns/* file. PTRACE_MODE_ATTACH For "write" operations, or other operations that are moredangerous,suchas:ptraceattaching (PTRACE_ATTACH)to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effec‐ tively the default before Linux 2.6.27.) Since Linux 4.5, the above access mode checks may be combined s/may/must/; otherwise __ptrace_may_access() will yell about the kernel code being broken and deny access. Good point. I changed "may" to "are". ("must" is not quite right to my "user-space" ear; it might be misread as implying that the user-space developer must do something.) (ORed) with one of the following modifiers: PTRACE_MODE_FSCREDS Use the caller's filesystem UID and GID (see creden‐ tials(7)) or effective capabilities for LSM checks. PTRACE_MODE_REALCREDS Use the caller's real UID and GID or permitted capabili‐ ties for LSM checks. This was effectively the default before Linux 4.5. Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations: PTRACE_MODE_READ_FSCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS. PTRACE_MODE_READ_REALCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS. PTRACE_MODE_ATTACH_FSCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS. PTRACE_MODE_ATTACH_REALCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS. One further modifier can be ORed with the access mode: PTRACE_MODE_NOAUDIT (since Linux 3.3) Don't audit this access mode check. [I'd quite welcome some text to explain "auditing" here.] The algorithm employed for ptrace access mode checking deter‐ mines whether the calling process is allowed to perform the corresponding action on the target process, as follows: 1. If the calling thread and the target thread are in the same thread group, access is always allowed. 2. If the access mode specifies PTRACE_MODE_FSCREDS, then for the check in the next step, employ the caller's filesystem user ID and group ID (see credentials(7)); otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller's real user ID and group ID. Might want to add a "for historical reasons" or so here.
Re: Documenting ptrace access mode checking
Hi Jann, On 06/21/2016 10:55 PM, Jann Horn wrote: On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote: Hi Jann, Stephen, et al. Jann, since you recently committed a patch in this area, and Stephen, since you committed 006ebb40d3d much further back in time, I wonder if you might help me by reviewing the text below that I propose to add to the ptrace(2) man page, in order to document "ptrace access mode checking" that is performed in various parts of the kernel-user-space interface. Of course, I welcome input from anyone else as well. Here's the new ptrace(2) text. Any comments, technical or terminological fixes, other improvements, etc. are welcome. As others have said, I'm surprised about seeing documentation about kernel-internal constants in manpages - but I think it might be a good thing to have there, given that people who look at ptrace(2) are likely to be interested in low-level details. I agree that it is a little surprising to add kernel-internal constants in a man page. (There are precedents, but they are few.) But see my reply to Kees. It's more than just explaining low level details: there are various kinds of user-space behavior differences (real vs filesystem credentials; permitted vs effective capabilities) produced by the ptrace_may_access() checks, and those behaviors need to be described and *somehow* labeled for cross-referencing from other man pages. [[ Ptrace access mode checking Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by Linux Security Modules (LSMs) such as SELinux, Yama, Smack, or the default LSM. Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: PTRACE_MODE_READ For "read" operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/[pid]/auxv, /proc/[pid]/environ,or /proc/[pid]/stat; or readlink(2) of a /proc/[pid]/ns/* file. PTRACE_MODE_ATTACH For "write" operations, or other operations that are moredangerous,suchas:ptraceattaching (PTRACE_ATTACH)to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effec‐ tively the default before Linux 2.6.27.) Since Linux 4.5, the above access mode checks may be combined s/may/must/; otherwise __ptrace_may_access() will yell about the kernel code being broken and deny access. Good point. I changed "may" to "are". ("must" is not quite right to my "user-space" ear; it might be misread as implying that the user-space developer must do something.) (ORed) with one of the following modifiers: PTRACE_MODE_FSCREDS Use the caller's filesystem UID and GID (see creden‐ tials(7)) or effective capabilities for LSM checks. PTRACE_MODE_REALCREDS Use the caller's real UID and GID or permitted capabili‐ ties for LSM checks. This was effectively the default before Linux 4.5. Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations: PTRACE_MODE_READ_FSCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS. PTRACE_MODE_READ_REALCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS. PTRACE_MODE_ATTACH_FSCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS. PTRACE_MODE_ATTACH_REALCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS. One further modifier can be ORed with the access mode: PTRACE_MODE_NOAUDIT (since Linux 3.3) Don't audit this access mode check. [I'd quite welcome some text to explain "auditing" here.] The algorithm employed for ptrace access mode checking deter‐ mines whether the calling process is allowed to perform the corresponding action on the target process, as follows: 1. If the calling thread and the target thread are in the same thread group, access is always allowed. 2. If the access mode specifies PTRACE_MODE_FSCREDS, then for the check in the next step, employ the caller's filesystem user ID and group ID (see credentials(7)); otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller's real user ID and group ID. Might want to add a "for historical reasons" or so here.
Re: Documenting ptrace access mode checking
Hi Kees, On 06/21/2016 10:29 PM, Kees Cook wrote: On Tue, Jun 21, 2016 at 12:55 PM, Eric W. Biederman <ebied...@xmission.com> wrote: Adding Oleg just because he seems to do most of the ptrace related maintenance these days. "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes: Hi Jann, Stephen, et al. Jann, since you recently committed a patch in this area, and Stephen, since you committed 006ebb40d3d much further back in time, I wonder if you might help me by reviewing the text below that I propose to add to the ptrace(2) man page, in order to document "ptrace access mode checking" that is performed in various parts of the kernel-user-space interface. Of course, I welcome input from anyone else as well. Your text matches my understand of this code. :) Thanks for reviewing the text! Here's the new ptrace(2) text. Any comments, technical or terminological fixes, other improvements, etc. are welcome. [[ Ptrace access mode checking Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by Linux Security Modules (LSMs) such as SELinux, Yama, Smack, or the default LSM. Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: PTRACE_MODE_READ For "read" operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/[pid]/auxv, /proc/[pid]/environ,or /proc/[pid]/stat; or readlink(2) of a /proc/[pid]/ns/* file. PTRACE_MODE_ATTACH For "write" operations, or other operations that are moredangerous,suchas:ptraceattaching (PTRACE_ATTACH)to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effec‐ tively the default before Linux 2.6.27.) Since Linux 4.5, the above access mode checks may be combined (ORed) with one of the following modifiers: PTRACE_MODE_FSCREDS Use the caller's filesystem UID and GID (see creden‐ tials(7)) or effective capabilities for LSM checks. PTRACE_MODE_REALCREDS Use the caller's real UID and GID or permitted capabili‐ ties for LSM checks. This was effectively the default before Linux 4.5. Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations: PTRACE_MODE_READ_FSCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS. PTRACE_MODE_READ_REALCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS. PTRACE_MODE_ATTACH_FSCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS. PTRACE_MODE_ATTACH_REALCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS. One further modifier can be ORed with the access mode: PTRACE_MODE_NOAUDIT (since Linux 3.3) Don't audit this access mode check. [I'd quite welcome some text to explain "auditing" here.] AKA don't let the audit subsystem know. Which tends to generate audit records capable is called. The algorithm employed for ptrace access mode checking deter‐ mines whether the calling process is allowed to perform the corresponding action on the target process, as follows: 1. If the calling thread and the target thread are in the same thread group, access is always allowed. This test only exsits because the LSMs historically and I suspect continue to be broken and deny a process the ability to ptrace itself. Well, it's not that the LSMs are broken, it's that self-inspection is a short-circuited "allow". The LSMs aren't involved. 2. If the access mode specifies PTRACE_MODE_FSCREDS, then for the check in the next step, employ the caller's filesystem user ID and group ID (see credentials(7)); otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller's real user ID and group ID. 3. Deny access if neither of the following is true: · The real, effective, and saved-set user IDs of the target match the caller's user ID, and the real, effective, and saved-set group IDs of the target match the caller's group ID. · The caller has the CAP_SYS_PTRACE capability. 4. Deny access if the target process "dumpable" attribute has a valu
Re: Documenting ptrace access mode checking
Hi Kees, On 06/21/2016 10:29 PM, Kees Cook wrote: On Tue, Jun 21, 2016 at 12:55 PM, Eric W. Biederman wrote: Adding Oleg just because he seems to do most of the ptrace related maintenance these days. "Michael Kerrisk (man-pages)" writes: Hi Jann, Stephen, et al. Jann, since you recently committed a patch in this area, and Stephen, since you committed 006ebb40d3d much further back in time, I wonder if you might help me by reviewing the text below that I propose to add to the ptrace(2) man page, in order to document "ptrace access mode checking" that is performed in various parts of the kernel-user-space interface. Of course, I welcome input from anyone else as well. Your text matches my understand of this code. :) Thanks for reviewing the text! Here's the new ptrace(2) text. Any comments, technical or terminological fixes, other improvements, etc. are welcome. [[ Ptrace access mode checking Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by Linux Security Modules (LSMs) such as SELinux, Yama, Smack, or the default LSM. Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: PTRACE_MODE_READ For "read" operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/[pid]/auxv, /proc/[pid]/environ,or /proc/[pid]/stat; or readlink(2) of a /proc/[pid]/ns/* file. PTRACE_MODE_ATTACH For "write" operations, or other operations that are moredangerous,suchas:ptraceattaching (PTRACE_ATTACH)to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effec‐ tively the default before Linux 2.6.27.) Since Linux 4.5, the above access mode checks may be combined (ORed) with one of the following modifiers: PTRACE_MODE_FSCREDS Use the caller's filesystem UID and GID (see creden‐ tials(7)) or effective capabilities for LSM checks. PTRACE_MODE_REALCREDS Use the caller's real UID and GID or permitted capabili‐ ties for LSM checks. This was effectively the default before Linux 4.5. Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations: PTRACE_MODE_READ_FSCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS. PTRACE_MODE_READ_REALCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS. PTRACE_MODE_ATTACH_FSCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS. PTRACE_MODE_ATTACH_REALCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS. One further modifier can be ORed with the access mode: PTRACE_MODE_NOAUDIT (since Linux 3.3) Don't audit this access mode check. [I'd quite welcome some text to explain "auditing" here.] AKA don't let the audit subsystem know. Which tends to generate audit records capable is called. The algorithm employed for ptrace access mode checking deter‐ mines whether the calling process is allowed to perform the corresponding action on the target process, as follows: 1. If the calling thread and the target thread are in the same thread group, access is always allowed. This test only exsits because the LSMs historically and I suspect continue to be broken and deny a process the ability to ptrace itself. Well, it's not that the LSMs are broken, it's that self-inspection is a short-circuited "allow". The LSMs aren't involved. 2. If the access mode specifies PTRACE_MODE_FSCREDS, then for the check in the next step, employ the caller's filesystem user ID and group ID (see credentials(7)); otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller's real user ID and group ID. 3. Deny access if neither of the following is true: · The real, effective, and saved-set user IDs of the target match the caller's user ID, and the real, effective, and saved-set group IDs of the target match the caller's group ID. · The caller has the CAP_SYS_PTRACE capability. 4. Deny access if the target process "dumpable" attribute has a value other than 1 (SUID_DUMP_USER; see the discussion of PR_SET_DU
Re: Documenting ptrace access mode checking
Hi Eric, On 06/21/2016 09:55 PM, Eric W. Biederman wrote: Adding Oleg just because he seems to do most of the ptrace related maintenance these days. "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes: Hi Jann, Stephen, et al. Jann, since you recently committed a patch in this area, and Stephen, since you committed 006ebb40d3d much further back in time, I wonder if you might help me by reviewing the text below that I propose to add to the ptrace(2) man page, in order to document "ptrace access mode checking" that is performed in various parts of the kernel-user-space interface. Of course, I welcome input from anyone else as well. Here's the new ptrace(2) text. Any comments, technical or terminological fixes, other improvements, etc. are welcome. [[ Ptrace access mode checking Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by Linux Security Modules (LSMs) such as SELinux, Yama, Smack, or the default LSM. Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: PTRACE_MODE_READ For "read" operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/[pid]/auxv, /proc/[pid]/environ,or /proc/[pid]/stat; or readlink(2) of a /proc/[pid]/ns/* file. PTRACE_MODE_ATTACH For "write" operations, or other operations that are moredangerous,suchas:ptraceattaching (PTRACE_ATTACH)to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effec‐ tively the default before Linux 2.6.27.) Since Linux 4.5, the above access mode checks may be combined (ORed) with one of the following modifiers: PTRACE_MODE_FSCREDS Use the caller's filesystem UID and GID (see creden‐ tials(7)) or effective capabilities for LSM checks. PTRACE_MODE_REALCREDS Use the caller's real UID and GID or permitted capabili‐ ties for LSM checks. This was effectively the default before Linux 4.5. Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations: PTRACE_MODE_READ_FSCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS. PTRACE_MODE_READ_REALCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS. PTRACE_MODE_ATTACH_FSCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS. PTRACE_MODE_ATTACH_REALCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS. One further modifier can be ORed with the access mode: PTRACE_MODE_NOAUDIT (since Linux 3.3) Don't audit this access mode check. [I'd quite welcome some text to explain "auditing" here.] AKA don't let the audit subsystem know. Which tends to generate audit records capable is called. The algorithm employed for ptrace access mode checking deter‐ mines whether the calling process is allowed to perform the corresponding action on the target process, as follows: 1. If the calling thread and the target thread are in the same thread group, access is always allowed. This test only exsits because the LSMs historically and I suspect continue to be broken and deny a process the ability to ptrace itself. 2. If the access mode specifies PTRACE_MODE_FSCREDS, then for the check in the next step, employ the caller's filesystem user ID and group ID (see credentials(7)); otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller's real user ID and group ID. 3. Deny access if neither of the following is true: · The real, effective, and saved-set user IDs of the target match the caller's user ID, and the real, effective, and saved-set group IDs of the target match the caller's group ID. · The caller has the CAP_SYS_PTRACE capability. 4. Deny access if the target process "dumpable" attribute has a value other than 1 (SUID_DUMP_USER; see the discussion of PR_SET_DUMPABLE in prctl(2)), and the caller does not have the CAP_SYS_PTRACE capability in the user namespace of the target process. 5. The kernel LSM security_ptrace_access_check() interface is invoked to s
Re: Documenting ptrace access mode checking
Hi Eric, On 06/21/2016 09:55 PM, Eric W. Biederman wrote: Adding Oleg just because he seems to do most of the ptrace related maintenance these days. "Michael Kerrisk (man-pages)" writes: Hi Jann, Stephen, et al. Jann, since you recently committed a patch in this area, and Stephen, since you committed 006ebb40d3d much further back in time, I wonder if you might help me by reviewing the text below that I propose to add to the ptrace(2) man page, in order to document "ptrace access mode checking" that is performed in various parts of the kernel-user-space interface. Of course, I welcome input from anyone else as well. Here's the new ptrace(2) text. Any comments, technical or terminological fixes, other improvements, etc. are welcome. [[ Ptrace access mode checking Various parts of the kernel-user-space API (not just ptrace(2) operations), require so-called "ptrace access mode permissions" which are gated by Linux Security Modules (LSMs) such as SELinux, Yama, Smack, or the default LSM. Prior to Linux 2.6.27, all such checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished: PTRACE_MODE_READ For "read" operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/[pid]/auxv, /proc/[pid]/environ,or /proc/[pid]/stat; or readlink(2) of a /proc/[pid]/ns/* file. PTRACE_MODE_ATTACH For "write" operations, or other operations that are moredangerous,suchas:ptraceattaching (PTRACE_ATTACH)to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effec‐ tively the default before Linux 2.6.27.) Since Linux 4.5, the above access mode checks may be combined (ORed) with one of the following modifiers: PTRACE_MODE_FSCREDS Use the caller's filesystem UID and GID (see creden‐ tials(7)) or effective capabilities for LSM checks. PTRACE_MODE_REALCREDS Use the caller's real UID and GID or permitted capabili‐ ties for LSM checks. This was effectively the default before Linux 4.5. Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations: PTRACE_MODE_READ_FSCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS. PTRACE_MODE_READ_REALCREDS Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS. PTRACE_MODE_ATTACH_FSCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS. PTRACE_MODE_ATTACH_REALCREDS Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS. One further modifier can be ORed with the access mode: PTRACE_MODE_NOAUDIT (since Linux 3.3) Don't audit this access mode check. [I'd quite welcome some text to explain "auditing" here.] AKA don't let the audit subsystem know. Which tends to generate audit records capable is called. The algorithm employed for ptrace access mode checking deter‐ mines whether the calling process is allowed to perform the corresponding action on the target process, as follows: 1. If the calling thread and the target thread are in the same thread group, access is always allowed. This test only exsits because the LSMs historically and I suspect continue to be broken and deny a process the ability to ptrace itself. 2. If the access mode specifies PTRACE_MODE_FSCREDS, then for the check in the next step, employ the caller's filesystem user ID and group ID (see credentials(7)); otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller's real user ID and group ID. 3. Deny access if neither of the following is true: · The real, effective, and saved-set user IDs of the target match the caller's user ID, and the real, effective, and saved-set group IDs of the target match the caller's group ID. · The caller has the CAP_SYS_PTRACE capability. 4. Deny access if the target process "dumpable" attribute has a value other than 1 (SUID_DUMP_USER; see the discussion of PR_SET_DUMPABLE in prctl(2)), and the caller does not have the CAP_SYS_PTRACE capability in the user namespace of the target process. 5. The kernel LSM security_ptrace_access_check() interface is invoked to see if ptrace access is perm
Documenting ptrace access mode checking
b) Deny access if neither of the following is true: · The caller's capabilities are a proper superset of the target process's permitted capabilities. · The caller has the CAP_SYS_PTRACE capability in the target process's user namespace. Note that the default LSM does not distinguish between PTRACE_MODE_READ and PTRACE_MODE_ATTACH. 6. If access has not been denied by any of the preceding steps, then access is allowed. ]] There are accompanying changes to various pages that refer to the new text in ptrace(2), so that, for example, kcmp(2) adds: Permission to employ kcmp() is governed by ptrace access mode PTRACE_MODE_ATTACH_REALCREDS checks against both pid1 and pid2; see ptrace(2). and proc.5 has additions such as: /proc/[pid]/auxv (since 2.6.0-test7) ... Permission to access this file is governed by a ptrace accessmode PTRACE_MODE_READ_FSCREDS check; see ptrace(2). /proc/[pid]/cwd ... Permission to dereference or read (readlink(2)) this symbolic link is governed by a ptrace access mode PTRACE_MODE_READ_FSCREDS check; see ptrace(2). Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Documenting ptrace access mode checking
b) Deny access if neither of the following is true: · The caller's capabilities are a proper superset of the target process's permitted capabilities. · The caller has the CAP_SYS_PTRACE capability in the target process's user namespace. Note that the default LSM does not distinguish between PTRACE_MODE_READ and PTRACE_MODE_ATTACH. 6. If access has not been denied by any of the preceding steps, then access is allowed. ]] There are accompanying changes to various pages that refer to the new text in ptrace(2), so that, for example, kcmp(2) adds: Permission to employ kcmp() is governed by ptrace access mode PTRACE_MODE_ATTACH_REALCREDS checks against both pid1 and pid2; see ptrace(2). and proc.5 has additions such as: /proc/[pid]/auxv (since 2.6.0-test7) ... Permission to access this file is governed by a ptrace accessmode PTRACE_MODE_READ_FSCREDS check; see ptrace(2). /proc/[pid]/cwd ... Permission to dereference or read (readlink(2)) this symbolic link is governed by a ptrace access mode PTRACE_MODE_READ_FSCREDS check; see ptrace(2). Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 0/9] [v2] System Calls for Memory Protection Keys
On 06/07/2016 10:47 PM, Dave Hansen wrote: > Are there any concerns with merging these into the x86 tree so > that they go upstream for 4.8? I believe we still don't have up-to-date man pages, right? Best from my POV to send them out in parallel with the implementation. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 0/9] [v2] System Calls for Memory Protection Keys
On 06/07/2016 10:47 PM, Dave Hansen wrote: > Are there any concerns with merging these into the x86 tree so > that they go upstream for 4.8? I believe we still don't have up-to-date man pages, right? Best from my POV to send them out in parallel with the implementation. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls
On 06/03/2016 12:28 PM, Dave Hansen wrote: > On 06/02/2016 05:26 PM, Michael Kerrisk (man-pages) wrote: >> On 06/01/2016 07:17 PM, Dave Hansen wrote: >>> On 06/01/2016 05:11 PM, Michael Kerrisk (man-pages) wrote: >>>>>>>> >>>>>>>> If I read this right, it doesn't actually remove any pkey restrictions >>>>>>>> that may have been applied while the key was allocated. So there >>>>>>>> could be >>>>>>>> pages with that key assigned that might do surprising things if the >>>>>>>> key is >>>>>>>> reallocated for another use later, right? Is that how the API is >>>>>>>> intended >>>>>>>> to work? >>>>>> >>>>>> Yeah, that's how it works. >>>>>> >>>>>> It's not ideal. It would be _best_ if we during mm_pkey_free(), we >>>>>> ensured that no VMAs under that mm have that vma_pkey() set. But, that >>>>>> search would be potentially expensive (a walk over all VMAs), or would >>>>>> force us to keep a data structure with a count of all the VMAs with a >>>>>> given key. >>>>>> >>>>>> I should probably discuss this behavior in the manpages and address it >>>> s/probably// >>>> >>>> And, did I miss it. Was there an updated man-pages patch in the latest >>>> series? I did not notice it. >>> >>> There have been to changes to the patches that warranted updating the >>> manpages until now. I'll send the update immediately. >> >> Do those updated pages include discussion of the point noted above? >> I could not see it mentioned there. > > I added the following text to pkey_alloc.2. I somehow neglected to send > it out in the v3 update of the manpages RFC: > > An application should not call > .BR pkey_free () > on any protection key which has been assigned to an address > range by > .BR pkey_mprotect () > and which is still in use. The behavior in this case is > undefined and may result in an error. > > I'll add that in the version (v4) I send out shortly. > >> Just by the way, the above behavior seems to offer possibilities >> for users to shoot themselves in the foot, in a way that has security >> implications. (Or do I misunderstand?) > > Protection keys has the potential to add a layer of security and > reliability to applications. But, it has not been primarily designed as > a security feature. For instance, WRPKRU is a completely unprivileged > instruction, so pkeys are useless in any case that an attacker controls > the PKRU register or can execute arbitrary instructions. > > That said, this mechanism does, indeed, allow a user to shoot themselves > in the foot and in a way that could have security implications. > > For instance, say the following happened: > 1. A sensitive bit of data in memory was marked with a pkey > 2. That pkey was set as PKEY_DISABLE_ACCESS > 3. The application called pkey_free() on the pkey, without freeing >the sensitive data > 4. Application calls pkey_alloc() and then clears PKEY_DISABLE_ACCESS > 5. Applocation can now read the sensitive data > > The application has to have basically "leaked" a reference to the pkey. > It forgot that it had sensitive data marked with that key. > > The kernel _could_ enforce that no in-use pkey may have pkey_free() > called on it. But, doing that has tradeoffs which could make > pkey_free() extremely slow: > >> It's not ideal. It would be _best_ if we during mm_pkey_free(), we >> ensured that no VMAs under that mm have that vma_pkey() set. But, that >> search would be potentially expensive (a walk over all VMAs), or would >> force us to keep a data structure with a count of all the VMAs with a >> given key. > > In addition, that checking _could_ be implemented in an application by > inspecting /proc/$pid/smaps for "ProtectionKey: $foo" before calling > pkey_free($foo). So, I think all of the above needs to be made abundantly clear in pkeys(7). Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls
On 06/03/2016 12:28 PM, Dave Hansen wrote: > On 06/02/2016 05:26 PM, Michael Kerrisk (man-pages) wrote: >> On 06/01/2016 07:17 PM, Dave Hansen wrote: >>> On 06/01/2016 05:11 PM, Michael Kerrisk (man-pages) wrote: >>>>>>>> >>>>>>>> If I read this right, it doesn't actually remove any pkey restrictions >>>>>>>> that may have been applied while the key was allocated. So there >>>>>>>> could be >>>>>>>> pages with that key assigned that might do surprising things if the >>>>>>>> key is >>>>>>>> reallocated for another use later, right? Is that how the API is >>>>>>>> intended >>>>>>>> to work? >>>>>> >>>>>> Yeah, that's how it works. >>>>>> >>>>>> It's not ideal. It would be _best_ if we during mm_pkey_free(), we >>>>>> ensured that no VMAs under that mm have that vma_pkey() set. But, that >>>>>> search would be potentially expensive (a walk over all VMAs), or would >>>>>> force us to keep a data structure with a count of all the VMAs with a >>>>>> given key. >>>>>> >>>>>> I should probably discuss this behavior in the manpages and address it >>>> s/probably// >>>> >>>> And, did I miss it. Was there an updated man-pages patch in the latest >>>> series? I did not notice it. >>> >>> There have been to changes to the patches that warranted updating the >>> manpages until now. I'll send the update immediately. >> >> Do those updated pages include discussion of the point noted above? >> I could not see it mentioned there. > > I added the following text to pkey_alloc.2. I somehow neglected to send > it out in the v3 update of the manpages RFC: > > An application should not call > .BR pkey_free () > on any protection key which has been assigned to an address > range by > .BR pkey_mprotect () > and which is still in use. The behavior in this case is > undefined and may result in an error. > > I'll add that in the version (v4) I send out shortly. > >> Just by the way, the above behavior seems to offer possibilities >> for users to shoot themselves in the foot, in a way that has security >> implications. (Or do I misunderstand?) > > Protection keys has the potential to add a layer of security and > reliability to applications. But, it has not been primarily designed as > a security feature. For instance, WRPKRU is a completely unprivileged > instruction, so pkeys are useless in any case that an attacker controls > the PKRU register or can execute arbitrary instructions. > > That said, this mechanism does, indeed, allow a user to shoot themselves > in the foot and in a way that could have security implications. > > For instance, say the following happened: > 1. A sensitive bit of data in memory was marked with a pkey > 2. That pkey was set as PKEY_DISABLE_ACCESS > 3. The application called pkey_free() on the pkey, without freeing >the sensitive data > 4. Application calls pkey_alloc() and then clears PKEY_DISABLE_ACCESS > 5. Applocation can now read the sensitive data > > The application has to have basically "leaked" a reference to the pkey. > It forgot that it had sensitive data marked with that key. > > The kernel _could_ enforce that no in-use pkey may have pkey_free() > called on it. But, doing that has tradeoffs which could make > pkey_free() extremely slow: > >> It's not ideal. It would be _best_ if we during mm_pkey_free(), we >> ensured that no VMAs under that mm have that vma_pkey() set. But, that >> search would be potentially expensive (a walk over all VMAs), or would >> force us to keep a data structure with a count of all the VMAs with a >> given key. > > In addition, that checking _could_ be implemented in an application by > inspecting /proc/$pid/smaps for "ProtectionKey: $foo" before calling > pkey_free($foo). So, I think all of the above needs to be made abundantly clear in pkeys(7). Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls
On 06/01/2016 07:17 PM, Dave Hansen wrote: > On 06/01/2016 05:11 PM, Michael Kerrisk (man-pages) wrote: >>>>>> >>>>>> If I read this right, it doesn't actually remove any pkey restrictions >>>>>> that may have been applied while the key was allocated. So there could >>>>>> be >>>>>> pages with that key assigned that might do surprising things if the key >>>>>> is >>>>>> reallocated for another use later, right? Is that how the API is >>>>>> intended >>>>>> to work? >>>> >>>> Yeah, that's how it works. >>>> >>>> It's not ideal. It would be _best_ if we during mm_pkey_free(), we >>>> ensured that no VMAs under that mm have that vma_pkey() set. But, that >>>> search would be potentially expensive (a walk over all VMAs), or would >>>> force us to keep a data structure with a count of all the VMAs with a >>>> given key. >>>> >>>> I should probably discuss this behavior in the manpages and address it >> s/probably// >> >> And, did I miss it. Was there an updated man-pages patch in the latest >> series? I did not notice it. > > There have been to changes to the patches that warranted updating the > manpages until now. I'll send the update immediately. Do those updated pages include discussion of the point noted above? I could not see it mentioned there. Just by the way, the above behavior seems to offer possibilities for users to shoot themselves in the foot, in a way that has security implications. (Or do I misunderstand?) Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls
On 06/01/2016 07:17 PM, Dave Hansen wrote: > On 06/01/2016 05:11 PM, Michael Kerrisk (man-pages) wrote: >>>>>> >>>>>> If I read this right, it doesn't actually remove any pkey restrictions >>>>>> that may have been applied while the key was allocated. So there could >>>>>> be >>>>>> pages with that key assigned that might do surprising things if the key >>>>>> is >>>>>> reallocated for another use later, right? Is that how the API is >>>>>> intended >>>>>> to work? >>>> >>>> Yeah, that's how it works. >>>> >>>> It's not ideal. It would be _best_ if we during mm_pkey_free(), we >>>> ensured that no VMAs under that mm have that vma_pkey() set. But, that >>>> search would be potentially expensive (a walk over all VMAs), or would >>>> force us to keep a data structure with a count of all the VMAs with a >>>> given key. >>>> >>>> I should probably discuss this behavior in the manpages and address it >> s/probably// >> >> And, did I miss it. Was there an updated man-pages patch in the latest >> series? I did not notice it. > > There have been to changes to the patches that warranted updating the > manpages until now. I'll send the update immediately. Do those updated pages include discussion of the point noted above? I could not see it mentioned there. Just by the way, the above behavior seems to offer possibilities for users to shoot themselves in the foot, in a way that has security implications. (Or do I misunderstand?) Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls
Hi Dave, On 1 June 2016 at 14:32, Dave Hansen <d...@sr71.net> wrote: > On 06/01/2016 11:37 AM, Jonathan Corbet wrote: >>> +static inline >>> +int mm_pkey_free(struct mm_struct *mm, int pkey) >>> +{ >>> +/* >>> + * pkey 0 is special, always allocated and can never >>> + * be freed. >>> + */ >>> +if (!pkey || !validate_pkey(pkey)) >>> +return -EINVAL; >>> +if (!mm_pkey_is_allocated(mm, pkey)) >>> +return -EINVAL; >>> + >>> +mm_set_pkey_free(mm, pkey); >>> + >>> +return 0; >>> +} >> >> If I read this right, it doesn't actually remove any pkey restrictions >> that may have been applied while the key was allocated. So there could be >> pages with that key assigned that might do surprising things if the key is >> reallocated for another use later, right? Is that how the API is intended >> to work? > > Yeah, that's how it works. > > It's not ideal. It would be _best_ if we during mm_pkey_free(), we > ensured that no VMAs under that mm have that vma_pkey() set. But, that > search would be potentially expensive (a walk over all VMAs), or would > force us to keep a data structure with a count of all the VMAs with a > given key. > > I should probably discuss this behavior in the manpages and address it s/probably// And, did I miss it. Was there an updated man-pages patch in the latest series? I did not notice it. > more directly in the changelog for this patch. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls
Hi Dave, On 1 June 2016 at 14:32, Dave Hansen wrote: > On 06/01/2016 11:37 AM, Jonathan Corbet wrote: >>> +static inline >>> +int mm_pkey_free(struct mm_struct *mm, int pkey) >>> +{ >>> +/* >>> + * pkey 0 is special, always allocated and can never >>> + * be freed. >>> + */ >>> +if (!pkey || !validate_pkey(pkey)) >>> +return -EINVAL; >>> +if (!mm_pkey_is_allocated(mm, pkey)) >>> +return -EINVAL; >>> + >>> +mm_set_pkey_free(mm, pkey); >>> + >>> +return 0; >>> +} >> >> If I read this right, it doesn't actually remove any pkey restrictions >> that may have been applied while the key was allocated. So there could be >> pages with that key assigned that might do surprising things if the key is >> reallocated for another use later, right? Is that how the API is intended >> to work? > > Yeah, that's how it works. > > It's not ideal. It would be _best_ if we during mm_pkey_free(), we > ensured that no VMAs under that mm have that vma_pkey() set. But, that > search would be potentially expensive (a walk over all VMAs), or would > force us to keep a data structure with a count of all the VMAs with a > given key. > > I should probably discuss this behavior in the manpages and address it s/probably// And, did I miss it. Was there an updated man-pages patch in the latest series? I did not notice it. > more directly in the changelog for this patch. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Mount namespace "dominant peer group"?
On 05/23/2016 02:55 AM, Miklos Szeredi wrote: > C is slave of B is slave of A. If a process can see (i.e. has under > its root) A and C but not B then for C it will show > master:B,propagate_from:A. This piece of information is shown because > it can't see the immediate master (B) and so cannot determine the > chain of propagation between the mounts it can see. Thanks, Miklos! > Concrete example: Yep, that does it. Thanks for the walk through! One piece missing below though, in case anyone else tries to walk through. > # mount --bind / /mnt > # mount --bind /proc /mnt/proc > # mount --make-private /mnt > # mount --make-shared /mnt > # mkdir /tmp/etc > # mount --bind /mnt/etc /tmp/etc > # mount --make-slave /tmp/etc > # mount --make-shared /tmp/etc # mkdir /mnt/tmp/etc > # mount --bind /tmp/etc /mnt/tmp/etc > # mount --make-slave /mnt/tmp/etc > # cat /proc/self/mountinfo | grep /tmp/etc > 164 40 253:1 /etc /tmp/etc rw,relatime shared:100 master:97 - ... > # chroot /mnt > # cat /proc/self/mountinfo > 129 62 253:1 / / rw,relatime shared:97 - ... > 168 129 253:1 /etc /tmp/etc rw,relatime master:100 propagate_from:97 - ... Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Mount namespace "dominant peer group"?
On 05/23/2016 02:55 AM, Miklos Szeredi wrote: > C is slave of B is slave of A. If a process can see (i.e. has under > its root) A and C but not B then for C it will show > master:B,propagate_from:A. This piece of information is shown because > it can't see the immediate master (B) and so cannot determine the > chain of propagation between the mounts it can see. Thanks, Miklos! > Concrete example: Yep, that does it. Thanks for the walk through! One piece missing below though, in case anyone else tries to walk through. > # mount --bind / /mnt > # mount --bind /proc /mnt/proc > # mount --make-private /mnt > # mount --make-shared /mnt > # mkdir /tmp/etc > # mount --bind /mnt/etc /tmp/etc > # mount --make-slave /tmp/etc > # mount --make-shared /tmp/etc # mkdir /mnt/tmp/etc > # mount --bind /tmp/etc /mnt/tmp/etc > # mount --make-slave /mnt/tmp/etc > # cat /proc/self/mountinfo | grep /tmp/etc > 164 40 253:1 /etc /tmp/etc rw,relatime shared:100 master:97 - ... > # chroot /mnt > # cat /proc/self/mountinfo > 129 62 253:1 / / rw,relatime shared:97 - ... > 168 129 253:1 /etc /tmp/etc rw,relatime master:100 propagate_from:97 - ... Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Mount namespace "dominant peer group"?
Hello Ram, On 05/20/2016 06:15 PM, Ram Pai wrote: > On Fri, May 20, 2016 at 04:24:18PM -0500, Michael Kerrisk (man-pages) wrote: >> Hello Miklos, >> >> I'm working on some better documentation of mount namespaces, >> and there's a detail that puzzles me, and I hope you might be >> able to help, since you added the detail... >> >> In Documentation/filesystems/proc.txt there is this text in the >> description of /proc/PID/mountinfo: >> >> [[ >> Parsers should ignore all unrecognised optional fields. Currently the >> possible optional fields are: >> >> shared:X mount is shared in peer group X >> master:X mount is slave to peer group X >> propagate_from:X mount is slave and receives propagation from peer group X >> (*) >> unbindable mount is unbindable >> >> (*) X is the closest dominant peer group under the process's root. If >> X is the immediate master of the mount, or if there's no dominant peer >> group under the same root, then only the "master:X" field is present >> and not the "propagate_from:X" field. >> ]] >> >> What is a dominant peer group, as distinct from the immediate master? >> >> I can see in fs/proc_namespaces.c that there is this distinction made: >> >> [[ >> /* Tagged fields ("foo:X" or "bar") */ >> if (IS_MNT_SHARED(r)) >> seq_printf(m, " shared:%i", r->mnt_group_id); >> if (IS_MNT_SLAVE(r)) { >> int master = r->mnt_master->mnt_group_id; >> int dom = get_dominating_id(r, >root); >> seq_printf(m, " master:%i", master); >> if (dom && dom != master) >> seq_printf(m, " propagate_from:%i", dom); >> } >> ]] >> >> But I can't relate that to some user-space semantics. I suppose another >> way of asking my question is: how could I create a slave that is >> propagating from a peer group other than it's immediate master? > > It can happen if you have unmounted or privatised all your master mounts from > the peer group. > > Eg: > > mount /dev/xyz /1#creates a new mount > mount --make-private /1 #just make sure that it does not receive or send > and propogation > mount --make-shared /1#now make it shared. > mount --bind /1 /2 #create a peer /1 and /2 are peers > create a new fs-namespace. this new fs-namespace which will have /1' and /2'. > /1 /2 /1' /2' are now all part of the same peergroup. > mount --make-slave /2 # this will make /2 a slave of the peer group that > contains /1 /1' and /2' > umount /1 # we now have /2 which receives propagation from a peer group > which does not have a representative in its fs-namespace. Thanks for the note. However, doing the above, I still do not see any mount being marked with 'propagate_from'. Perhaps I misunderstood your instructions above. Here's what I did: sh1# mount --make-private / # Make share everything is private... sh1# mount /dev/sdb6 /1 sh1# mount --make-private /1 sh1# mount --make-shared /1 sh1# mount --bind /1 /2 sh1# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//' 81 61 8:22 / /1 rw,relatime shared:1 82 61 8:22 / /2 rw,relatime shared:1 Then, at a second terminal, create a new mount NS: sh2# unshare -m --propagation unchanged sh sh2# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//' 169 132 8:22 / /1 rw,relatime shared:1 170 132 8:22 / /2 rw,relatime shared:1 Returning to the first terminal: sh1# mount --make-slave /2 sh1# umount /1 sh1# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//' 82 61 8:22 / /2 rw,relatime master:1 That is, we see /2 in the initial mount namespace is a slave but there is no 'propagate_from' tag. Did I miss something? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: Mount namespace "dominant peer group"?
Hello Ram, On 05/20/2016 06:15 PM, Ram Pai wrote: > On Fri, May 20, 2016 at 04:24:18PM -0500, Michael Kerrisk (man-pages) wrote: >> Hello Miklos, >> >> I'm working on some better documentation of mount namespaces, >> and there's a detail that puzzles me, and I hope you might be >> able to help, since you added the detail... >> >> In Documentation/filesystems/proc.txt there is this text in the >> description of /proc/PID/mountinfo: >> >> [[ >> Parsers should ignore all unrecognised optional fields. Currently the >> possible optional fields are: >> >> shared:X mount is shared in peer group X >> master:X mount is slave to peer group X >> propagate_from:X mount is slave and receives propagation from peer group X >> (*) >> unbindable mount is unbindable >> >> (*) X is the closest dominant peer group under the process's root. If >> X is the immediate master of the mount, or if there's no dominant peer >> group under the same root, then only the "master:X" field is present >> and not the "propagate_from:X" field. >> ]] >> >> What is a dominant peer group, as distinct from the immediate master? >> >> I can see in fs/proc_namespaces.c that there is this distinction made: >> >> [[ >> /* Tagged fields ("foo:X" or "bar") */ >> if (IS_MNT_SHARED(r)) >> seq_printf(m, " shared:%i", r->mnt_group_id); >> if (IS_MNT_SLAVE(r)) { >> int master = r->mnt_master->mnt_group_id; >> int dom = get_dominating_id(r, >root); >> seq_printf(m, " master:%i", master); >> if (dom && dom != master) >> seq_printf(m, " propagate_from:%i", dom); >> } >> ]] >> >> But I can't relate that to some user-space semantics. I suppose another >> way of asking my question is: how could I create a slave that is >> propagating from a peer group other than it's immediate master? > > It can happen if you have unmounted or privatised all your master mounts from > the peer group. > > Eg: > > mount /dev/xyz /1#creates a new mount > mount --make-private /1 #just make sure that it does not receive or send > and propogation > mount --make-shared /1#now make it shared. > mount --bind /1 /2 #create a peer /1 and /2 are peers > create a new fs-namespace. this new fs-namespace which will have /1' and /2'. > /1 /2 /1' /2' are now all part of the same peergroup. > mount --make-slave /2 # this will make /2 a slave of the peer group that > contains /1 /1' and /2' > umount /1 # we now have /2 which receives propagation from a peer group > which does not have a representative in its fs-namespace. Thanks for the note. However, doing the above, I still do not see any mount being marked with 'propagate_from'. Perhaps I misunderstood your instructions above. Here's what I did: sh1# mount --make-private / # Make share everything is private... sh1# mount /dev/sdb6 /1 sh1# mount --make-private /1 sh1# mount --make-shared /1 sh1# mount --bind /1 /2 sh1# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//' 81 61 8:22 / /1 rw,relatime shared:1 82 61 8:22 / /2 rw,relatime shared:1 Then, at a second terminal, create a new mount NS: sh2# unshare -m --propagation unchanged sh sh2# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//' 169 132 8:22 / /1 rw,relatime shared:1 170 132 8:22 / /2 rw,relatime shared:1 Returning to the first terminal: sh1# mount --make-slave /2 sh1# umount /1 sh1# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//' 82 61 8:22 / /2 rw,relatime master:1 That is, we see /2 in the initial mount namespace is a slave but there is no 'propagate_from' tag. Did I miss something? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Mount namespace "dominant peer group"?
Hello Miklos, I'm working on some better documentation of mount namespaces, and there's a detail that puzzles me, and I hope you might be able to help, since you added the detail... In Documentation/filesystems/proc.txt there is this text in the description of /proc/PID/mountinfo: [[ Parsers should ignore all unrecognised optional fields. Currently the possible optional fields are: shared:X mount is shared in peer group X master:X mount is slave to peer group X propagate_from:X mount is slave and receives propagation from peer group X (*) unbindable mount is unbindable (*) X is the closest dominant peer group under the process's root. If X is the immediate master of the mount, or if there's no dominant peer group under the same root, then only the "master:X" field is present and not the "propagate_from:X" field. ]] What is a dominant peer group, as distinct from the immediate master? I can see in fs/proc_namespaces.c that there is this distinction made: [[ /* Tagged fields ("foo:X" or "bar") */ if (IS_MNT_SHARED(r)) seq_printf(m, " shared:%i", r->mnt_group_id); if (IS_MNT_SLAVE(r)) { int master = r->mnt_master->mnt_group_id; int dom = get_dominating_id(r, >root); seq_printf(m, " master:%i", master); if (dom && dom != master) seq_printf(m, " propagate_from:%i", dom); } ]] But I can't relate that to some user-space semantics. I suppose another way of asking my question is: how could I create a slave that is propagating from a peer group other than it's immediate master? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Mount namespace "dominant peer group"?
Hello Miklos, I'm working on some better documentation of mount namespaces, and there's a detail that puzzles me, and I hope you might be able to help, since you added the detail... In Documentation/filesystems/proc.txt there is this text in the description of /proc/PID/mountinfo: [[ Parsers should ignore all unrecognised optional fields. Currently the possible optional fields are: shared:X mount is shared in peer group X master:X mount is slave to peer group X propagate_from:X mount is slave and receives propagation from peer group X (*) unbindable mount is unbindable (*) X is the closest dominant peer group under the process's root. If X is the immediate master of the mount, or if there's no dominant peer group under the same root, then only the "master:X" field is present and not the "propagate_from:X" field. ]] What is a dominant peer group, as distinct from the immediate master? I can see in fs/proc_namespaces.c that there is this distinction made: [[ /* Tagged fields ("foo:X" or "bar") */ if (IS_MNT_SHARED(r)) seq_printf(m, " shared:%i", r->mnt_group_id); if (IS_MNT_SLAVE(r)) { int master = r->mnt_master->mnt_group_id; int dom = get_dominating_id(r, >root); seq_printf(m, " master:%i", master); if (dom && dom != master) seq_printf(m, " propagate_from:%i", dom); } ]] But I can't relate that to some user-space semantics. I suppose another way of asking my question is: how could I create a slave that is propagating from a peer group other than it's immediate master? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
man-pages-4.06 is released
Gidday, The Linux man-pages maintainer proudly announces: man-pages-4.06 - man pages for Linux This release includes input and contributions from around 20 people. Around 40 pages saw changes, ranging from typo fixes through to page rewrites and newly created pages. Tarball download: http://www.kernel.org/doc/man-pages/download.html Git repository: https://git.kernel.org/cgit/docs/man-pages/man-pages.git/ Online changelog: http://man7.org/linux/man-pages/changelog.html#release_4.06 A short summary of the release is blogged at: http://linux-man-pages.blogspot.com/2016/05/man-pages-406-is-released.html The current version of the pages is browsable at: http://man7.org/linux/man-pages/ A selection of changes in this release that may be of interest to readers on LKML is shown below. Cheers, Michael Changes in man-pages-4.06 New and rewritten pages --- cgroups.7 Serge Hallyn, Michael Kerrisk New page documenting cgroups cgroup_namespaces.7 Michael Kerrisk [Serge Hallyn] New page describing cgroup namespaces Newly documented interfaces in existing pages - clone.2 Michael Kerrisk Document CLONE_NEWCGROUP readv.2 Christoph Hellwig Document preadv2() and pwritev2() setns.2 Michael Kerrisk Document CLONE_NEWCGROUP unshare.2 Michael Kerrisk Document CLONE_NEWCGROUP Changes to individual pages --- clock_getres.2 Michael Kerrisk [Rasmus Villemoes] Note that coarse clocks need architecture and VDSO support execve.2 Michael Kerrisk [Valery Reznic] Since Linux 2.6.28, recursive script interpretation is supported fcntl.2 Michael Kerrisk Note that mandatory locking is now governed by a configuration option mount.2 Michael Kerrisk MS_MANDLOCK requires CAP_SYS_ADMIN (since Linux 4.5) quotactl.2 Michael Kerrisk Document Q_GETNEXTQUOTA and Q_XGETNEXTQUOTA sigaction.2 Michael Kerrisk Document SEGV_BNDERR Michael Kerrisk Document SEGV_PKUERR core.5 Michael Kerrisk Document /proc/sys/kernel/core_pipe_limit namespaces.7 Michael Kerrisk SEE ALSO: add cgroups(7), cgroup_namespaces(7) vdso.7 Zubair Lutfullah Kakakhel [Mike Frysinger] Update for MIPS Document the symbols exported by the MIPS VDSO. VDSO support was added from kernel 4.4 onwards. See https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/arch/mips/vdso Michael Kerrisk [Rasmus Villemoes] The __kernel_clock_* interfaces don't support *_COARSE clocks on PowerPC ld.so.8 Michael Kerrisk [Alon Bar-Lev] Document use of $ORIGIN, $LIB, and $PLATFORM in environment variables These strings are meaningful in LD_LIBRARY_PATH and LD_PRELOAD. -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
man-pages-4.06 is released
Gidday, The Linux man-pages maintainer proudly announces: man-pages-4.06 - man pages for Linux This release includes input and contributions from around 20 people. Around 40 pages saw changes, ranging from typo fixes through to page rewrites and newly created pages. Tarball download: http://www.kernel.org/doc/man-pages/download.html Git repository: https://git.kernel.org/cgit/docs/man-pages/man-pages.git/ Online changelog: http://man7.org/linux/man-pages/changelog.html#release_4.06 A short summary of the release is blogged at: http://linux-man-pages.blogspot.com/2016/05/man-pages-406-is-released.html The current version of the pages is browsable at: http://man7.org/linux/man-pages/ A selection of changes in this release that may be of interest to readers on LKML is shown below. Cheers, Michael Changes in man-pages-4.06 New and rewritten pages --- cgroups.7 Serge Hallyn, Michael Kerrisk New page documenting cgroups cgroup_namespaces.7 Michael Kerrisk [Serge Hallyn] New page describing cgroup namespaces Newly documented interfaces in existing pages - clone.2 Michael Kerrisk Document CLONE_NEWCGROUP readv.2 Christoph Hellwig Document preadv2() and pwritev2() setns.2 Michael Kerrisk Document CLONE_NEWCGROUP unshare.2 Michael Kerrisk Document CLONE_NEWCGROUP Changes to individual pages --- clock_getres.2 Michael Kerrisk [Rasmus Villemoes] Note that coarse clocks need architecture and VDSO support execve.2 Michael Kerrisk [Valery Reznic] Since Linux 2.6.28, recursive script interpretation is supported fcntl.2 Michael Kerrisk Note that mandatory locking is now governed by a configuration option mount.2 Michael Kerrisk MS_MANDLOCK requires CAP_SYS_ADMIN (since Linux 4.5) quotactl.2 Michael Kerrisk Document Q_GETNEXTQUOTA and Q_XGETNEXTQUOTA sigaction.2 Michael Kerrisk Document SEGV_BNDERR Michael Kerrisk Document SEGV_PKUERR core.5 Michael Kerrisk Document /proc/sys/kernel/core_pipe_limit namespaces.7 Michael Kerrisk SEE ALSO: add cgroups(7), cgroup_namespaces(7) vdso.7 Zubair Lutfullah Kakakhel [Mike Frysinger] Update for MIPS Document the symbols exported by the MIPS VDSO. VDSO support was added from kernel 4.4 onwards. See https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/arch/mips/vdso Michael Kerrisk [Rasmus Villemoes] The __kernel_clock_* interfaces don't support *_COARSE clocks on PowerPC ld.so.8 Michael Kerrisk [Alon Bar-Lev] Document use of $ORIGIN, $LIB, and $PLATFORM in environment variables These strings are meaningful in LD_LIBRARY_PATH and LD_PRELOAD. -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH] mountinfo: implement show_path for kernfs and cgroup
Hi Serge, On 6 May 2016 at 19:33, Serge E. Hallyn <se...@hallyn.com> wrote: > Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com): >> Hi Serge, >> >> I'll add my own notes below, as much as anything in order to convince >> myself that I understand what's going on. >> >> On 05/05/2016 05:20 PM, Serge E. Hallyn wrote: >> > Short explanation: >> > >> > When showing a cgroupfs entry in mountinfo, show the path of the mount >> > root dentry relative to the reader's cgroup namespace root. >> >> As part of the commit message, I think it would be useful to add a >> sentence here explain why this is needed / which applications need it. >> >> > Long version: >> > >> > When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup >> > namespace, and then mounts a new instance of the freezer cgroup, the new >> > mount will be rooted at /a/b. The root dentry field of the mountinfo >> > entry will show '/a/b'. >> >> So, the point is that if we create a new cgroup namespace, >> then we want both /proc/self/cgroup and /proc/self/mountinfo >> to show cgroup paths that are correctly virtualized with >> respect to the cgroup mount point. Previous to this patch, >> /proc/self/cgroup shows the right info, but >> /proc/self/mountinfo does not. (Walk through in a moment.) >> >> Is the above a correct summary? Feel free to add that piece to the commit message :-). [...] >> So, I applied your patch against a current (i.e., 4.6-rc6) kernel. >> Same steps as before, and here's what I see: >> >> # mkdir -p /sys/fs/cgroup/freezer/a/b >> # echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs >> # ./cgroup_info.sh >> /proc/self/cgroup: 10:freezer:/a/b >> mountinfo: / /sys/fs/cgroup/freezer >> # ~mtk/tlpi/code/ns/unshare -Cm bash >> # ./cgroup_info.sh >> /proc/self/cgroup: 10:freezer:/ >> mountinfo: /../.. /sys/fs/cgroup/freezer >> # mount --make-rslave / >> # mkdir -p /mnt/freezer >> # umount /sys/fs/cgroup/freezer >> # mount -t cgroup -o freezer freezer /mnt/freezer/ >> # ./cgroup_info.sh >> /proc/self/cgroup: 10:freezer:/ >> mountinfo: / /mnt/freezer >> >> Now the root directory path shown by mountinfo is correct, >> and when we look inside the mount point, we see that things >> look "right" (i.e., a cgroup root directory with no >> subdirectories, and the PID of the shell run by unshare is >> in the cgroup.procs file of this cgroup): >> >> # ls /mnt/freezer/ >> cgroup.clone_children freezer.parent_freezing freezer.state tasks >> cgroup.procs freezer.self_freezingnotify_on_release >> # echo $$ >> 3164 >> # cat /mnt/freezer/cgroup.procs >> 2653 # First shell that placed in this cgroup >> 3164 # Shell started by 'unshare' >> 14197 # cat(1) >> >> All makes sense to me. > > Right. So in particular, docker wants to do something like: > > bindpath=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $4 }'` > mountpoint=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $5 > }'` > mycg=`awk -F: '/freezer/ { print $3 }' /proc/self/cgroup` > cat ${mountpoint}/${bindpath}/${mycg}/cgroup.procs > > and see its own task. I think that'd be a great piece to include in the commit message, near the top, as rationale for the patch >> Tested-by: Michael Kerrisk <mtk.manpa...@gmail.com> >> Acked-by: Michael Kerrisk <mtk.manpa...@gmail.com> >> >> (I did no review of the patch itself though.) > > Thanks, Michael. You're welcome. > I'll resend with corrections and a test script of > some sort. I think including some version of the two walk thoughs (without + with patch) would also make for a great commit message :-). Cheers, Michael [...]
Re: [PATCH] mountinfo: implement show_path for kernfs and cgroup
Hi Serge, On 6 May 2016 at 19:33, Serge E. Hallyn wrote: > Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com): >> Hi Serge, >> >> I'll add my own notes below, as much as anything in order to convince >> myself that I understand what's going on. >> >> On 05/05/2016 05:20 PM, Serge E. Hallyn wrote: >> > Short explanation: >> > >> > When showing a cgroupfs entry in mountinfo, show the path of the mount >> > root dentry relative to the reader's cgroup namespace root. >> >> As part of the commit message, I think it would be useful to add a >> sentence here explain why this is needed / which applications need it. >> >> > Long version: >> > >> > When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup >> > namespace, and then mounts a new instance of the freezer cgroup, the new >> > mount will be rooted at /a/b. The root dentry field of the mountinfo >> > entry will show '/a/b'. >> >> So, the point is that if we create a new cgroup namespace, >> then we want both /proc/self/cgroup and /proc/self/mountinfo >> to show cgroup paths that are correctly virtualized with >> respect to the cgroup mount point. Previous to this patch, >> /proc/self/cgroup shows the right info, but >> /proc/self/mountinfo does not. (Walk through in a moment.) >> >> Is the above a correct summary? Feel free to add that piece to the commit message :-). [...] >> So, I applied your patch against a current (i.e., 4.6-rc6) kernel. >> Same steps as before, and here's what I see: >> >> # mkdir -p /sys/fs/cgroup/freezer/a/b >> # echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs >> # ./cgroup_info.sh >> /proc/self/cgroup: 10:freezer:/a/b >> mountinfo: / /sys/fs/cgroup/freezer >> # ~mtk/tlpi/code/ns/unshare -Cm bash >> # ./cgroup_info.sh >> /proc/self/cgroup: 10:freezer:/ >> mountinfo: /../.. /sys/fs/cgroup/freezer >> # mount --make-rslave / >> # mkdir -p /mnt/freezer >> # umount /sys/fs/cgroup/freezer >> # mount -t cgroup -o freezer freezer /mnt/freezer/ >> # ./cgroup_info.sh >> /proc/self/cgroup: 10:freezer:/ >> mountinfo: / /mnt/freezer >> >> Now the root directory path shown by mountinfo is correct, >> and when we look inside the mount point, we see that things >> look "right" (i.e., a cgroup root directory with no >> subdirectories, and the PID of the shell run by unshare is >> in the cgroup.procs file of this cgroup): >> >> # ls /mnt/freezer/ >> cgroup.clone_children freezer.parent_freezing freezer.state tasks >> cgroup.procs freezer.self_freezingnotify_on_release >> # echo $$ >> 3164 >> # cat /mnt/freezer/cgroup.procs >> 2653 # First shell that placed in this cgroup >> 3164 # Shell started by 'unshare' >> 14197 # cat(1) >> >> All makes sense to me. > > Right. So in particular, docker wants to do something like: > > bindpath=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $4 }'` > mountpoint=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $5 > }'` > mycg=`awk -F: '/freezer/ { print $3 }' /proc/self/cgroup` > cat ${mountpoint}/${bindpath}/${mycg}/cgroup.procs > > and see its own task. I think that'd be a great piece to include in the commit message, near the top, as rationale for the patch >> Tested-by: Michael Kerrisk >> Acked-by: Michael Kerrisk >> >> (I did no review of the patch itself though.) > > Thanks, Michael. You're welcome. > I'll resend with corrections and a test script of > some sort. I think including some version of the two walk thoughs (without + with patch) would also make for a great commit message :-). Cheers, Michael [...]
Re: [PATCH] mountinfo: implement show_path for kernfs and cgroup
e information in /proc/PID/mountinfo. (The current patch fixes exactly this problem.) > With this patch, the dentry root field in mountinfo is shown relative > to the reader's cgroup namespace. I.e.: > > unshare -Gm bash /tmp/do1 > > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - > cgroup cgroup rw,freezer > > 355 133 0:34 / /mnt rw,relatime - cgroup freezer rw,freezer > > This way the task can correlate the paths in /proc/pid/cgroup to > /proc/self/mountinfo, and determine which cgroup directory (in any > mount which the reader created) corresponds to the task. So, I applied your patch against a current (i.e., 4.6-rc6) kernel. Same steps as before, and here's what I see: # mkdir -p /sys/fs/cgroup/freezer/a/b # echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs # ./cgroup_info.sh /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer # ~mtk/tlpi/code/ns/unshare -Cm bash # ./cgroup_info.sh /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer # mount --make-rslave / # mkdir -p /mnt/freezer # umount /sys/fs/cgroup/freezer # mount -t cgroup -o freezer freezer /mnt/freezer/ # ./cgroup_info.sh /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer Now the root directory path shown by mountinfo is correct, and when we look inside the mount point, we see that things look "right" (i.e., a cgroup root directory with no subdirectories, and the PID of the shell run by unshare is in the cgroup.procs file of this cgroup): # ls /mnt/freezer/ cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release # echo $$ 3164 # cat /mnt/freezer/cgroup.procs 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) All makes sense to me. Tested-by: Michael Kerrisk <mtk.manpa...@gmail.com> Acked-by: Michael Kerrisk <mtk.manpa...@gmail.com> (I did no review of the patch itself though.) Cheers, Michael > Signed-off-by: Serge Hallyn <serge.hal...@ubuntu.com> > --- > fs/kernfs/mount.c | 14 +++ > include/linux/kernfs.h | 2 ++ > kernel/cgroup.c| 63 > ++ > 3 files changed, 79 insertions(+) > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > index f73541f..3b78724 100644 > --- a/fs/kernfs/mount.c > +++ b/fs/kernfs/mount.c > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include > > #include "kernfs-internal.h" > > @@ -40,6 +41,18 @@ static int kernfs_sop_show_options(struct seq_file *sf, > struct dentry *dentry) > return 0; > } > > +static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > +{ > + struct kernfs_node *node = dentry->d_fsdata; > + struct kernfs_root *root = kernfs_root(node); > + struct kernfs_syscall_ops *scops = root->syscall_ops; > + > + if (scops && scops->show_path) > + return scops->show_path(sf, node, root); > + > + return seq_dentry(sf, dentry, " \t\n\\"); > +} > + > const struct super_operations kernfs_sops = { > .statfs = simple_statfs, > .drop_inode = generic_delete_inode, > @@ -47,6 +60,7 @@ const struct super_operations kernfs_sops = { > > .remount_fs = kernfs_sop_remount_fs, > .show_options = kernfs_sop_show_options, > + .show_path = kernfs_sop_show_path, > }; > > /** > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h > index c06c442..30f089e 100644 > --- a/include/linux/kernfs.h > +++ b/include/linux/kernfs.h > @@ -152,6 +152,8 @@ struct kernfs_syscall_ops { > int (*rmdir)(struct kernfs_node *kn); > int (*rename)(struct kernfs_node *kn, struct kernfs_node *new_parent, > const char *new_name); > + int (*show_path)(struct seq_file *sf, struct kernfs_node *kn, > + struct kernfs_root *root); > }; > > struct kernfs_root { > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 909a7d3..afea39e 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -1215,6 +1215,41 @@ static void cgroup_destroy_root(struct cgroup_root > *root) > cgroup_free_root(root); > } > > +/* > + * look up cgroup associated with current task's cgroup namespace on the > + * specified hierarchy > + */ > +static struct cgroup * > +current_cgns_cgroup_from_root(struct cgroup_root *root) > +{ > + struct cgroup *res = NULL
Re: [PATCH] mountinfo: implement show_path for kernfs and cgroup
e information in /proc/PID/mountinfo. (The current patch fixes exactly this problem.) > With this patch, the dentry root field in mountinfo is shown relative > to the reader's cgroup namespace. I.e.: > > unshare -Gm bash /tmp/do1 > > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - > cgroup cgroup rw,freezer > > 355 133 0:34 / /mnt rw,relatime - cgroup freezer rw,freezer > > This way the task can correlate the paths in /proc/pid/cgroup to > /proc/self/mountinfo, and determine which cgroup directory (in any > mount which the reader created) corresponds to the task. So, I applied your patch against a current (i.e., 4.6-rc6) kernel. Same steps as before, and here's what I see: # mkdir -p /sys/fs/cgroup/freezer/a/b # echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs # ./cgroup_info.sh /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer # ~mtk/tlpi/code/ns/unshare -Cm bash # ./cgroup_info.sh /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer # mount --make-rslave / # mkdir -p /mnt/freezer # umount /sys/fs/cgroup/freezer # mount -t cgroup -o freezer freezer /mnt/freezer/ # ./cgroup_info.sh /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer Now the root directory path shown by mountinfo is correct, and when we look inside the mount point, we see that things look "right" (i.e., a cgroup root directory with no subdirectories, and the PID of the shell run by unshare is in the cgroup.procs file of this cgroup): # ls /mnt/freezer/ cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release # echo $$ 3164 # cat /mnt/freezer/cgroup.procs 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) All makes sense to me. Tested-by: Michael Kerrisk Acked-by: Michael Kerrisk (I did no review of the patch itself though.) Cheers, Michael > Signed-off-by: Serge Hallyn > --- > fs/kernfs/mount.c | 14 +++ > include/linux/kernfs.h | 2 ++ > kernel/cgroup.c| 63 > ++ > 3 files changed, 79 insertions(+) > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > index f73541f..3b78724 100644 > --- a/fs/kernfs/mount.c > +++ b/fs/kernfs/mount.c > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include > > #include "kernfs-internal.h" > > @@ -40,6 +41,18 @@ static int kernfs_sop_show_options(struct seq_file *sf, > struct dentry *dentry) > return 0; > } > > +static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > +{ > + struct kernfs_node *node = dentry->d_fsdata; > + struct kernfs_root *root = kernfs_root(node); > + struct kernfs_syscall_ops *scops = root->syscall_ops; > + > + if (scops && scops->show_path) > + return scops->show_path(sf, node, root); > + > + return seq_dentry(sf, dentry, " \t\n\\"); > +} > + > const struct super_operations kernfs_sops = { > .statfs = simple_statfs, > .drop_inode = generic_delete_inode, > @@ -47,6 +60,7 @@ const struct super_operations kernfs_sops = { > > .remount_fs = kernfs_sop_remount_fs, > .show_options = kernfs_sop_show_options, > + .show_path = kernfs_sop_show_path, > }; > > /** > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h > index c06c442..30f089e 100644 > --- a/include/linux/kernfs.h > +++ b/include/linux/kernfs.h > @@ -152,6 +152,8 @@ struct kernfs_syscall_ops { > int (*rmdir)(struct kernfs_node *kn); > int (*rename)(struct kernfs_node *kn, struct kernfs_node *new_parent, > const char *new_name); > + int (*show_path)(struct seq_file *sf, struct kernfs_node *kn, > + struct kernfs_root *root); > }; > > struct kernfs_root { > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 909a7d3..afea39e 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -1215,6 +1215,41 @@ static void cgroup_destroy_root(struct cgroup_root > *root) > cgroup_free_root(root); > } > > +/* > + * look up cgroup associated with current task's cgroup namespace on the > + * specified hierarchy > + */ > +static struct cgroup * > +current_cgns_cgroup_from_root(struct cgroup_root *root) > +{ > + struct cgroup *res = NULL; > + struct css_set *cset; > + > + lockdep_assert_held(_set_lock); > + > + rcu_read_lock();
Re: [PATCH 1/1] simplified security.nscapability xattr
On 05/02/2016 05:54 AM, Serge E. Hallyn wrote: > On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote: >> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <se...@hallyn.com> wrote: >>> Quoting Kees Cook (keesc...@chromium.org): >>>> On Fri, Apr 22, 2016 at 10:26 AM, <serge.hal...@ubuntu.com> wrote: >>>>> From: Serge Hallyn <serge.hal...@ubuntu.com> > ... >>>> This looks like userspace must knowingly be aware that it is in a >>>> namespace and to DTRT instead of it being translated by the kernel >>>> when setxattr is called under !init_user_ns? >>> >>> Yes - my libcap2 patch checks /proc/self/uid_map to decide that. If that >>> shows you are in init_user_ns then it uses security.capability, otherwise >>> it uses security.nscapability. >>> >>> I've occasionally considered having the xattr code do the quiet >>> substitution if need be. >>> >>> In fact, much of this structure comes from when I was still trying to >>> do multiple values per xattr. Given what we're doing here, we could >>> keep the xattr contents exactly the same, just changing the name. >>> So userspace could just get and set security.capability; if you are >>> in a non-init user_ns, if security.capability is set then you cannot >>> set it; if security.capability is not set, then the kernel writes >>> security.nscapability instead and returns success. >>> >>> I don't like magic, but this might be just straightforward enough >>> to not be offensive. Thoughts? >> >> Yeah, I think it might be better to have the magic in this case, since >> it seems weird to just reject setxattr if a tool didn't realize it was >> in a namespace. I'm not sure -- it is also nice to have an explicit >> API here. >> >> I would defer to Eric or Michael on that. I keep going back and forth, >> though I suspect it's probably best to do what you already have >> (explicit API). > > Michael, Eric, what do you think? The choice we're making here is > whether we should > > 1. Keep a nice simple separate pair of xattrs, the pre-existing > security.capability which can only be written from init_user_ns, > and the new (in this patch) security.nscapability which you can > write to any file where you are privileged wrt the file. > > 2. Make security.capability somewhat 'magic' - if someone in a > non-initial user ns tries to write it and has privilege wrt the > file, then the kernel silently writes security.nscapability instead. > > The biggest drawback of (1) would be any tar-like program trying > to restore a file which had security.capability, needing to know > to detect its userns and write the security.nscapability instead. > The drawback of (2) is ~\o/~ magic. I have only (minor) thoughts from the interface perspective. (1) Sounds the source of possibly unpleasant surprises. (2) Is a little surprising, but less so if it's well documented, and it saves us the surprises of (1). So, (2) sounds better. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 1/1] simplified security.nscapability xattr
On 05/02/2016 05:54 AM, Serge E. Hallyn wrote: > On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote: >> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn wrote: >>> Quoting Kees Cook (keesc...@chromium.org): >>>> On Fri, Apr 22, 2016 at 10:26 AM, wrote: >>>>> From: Serge Hallyn > ... >>>> This looks like userspace must knowingly be aware that it is in a >>>> namespace and to DTRT instead of it being translated by the kernel >>>> when setxattr is called under !init_user_ns? >>> >>> Yes - my libcap2 patch checks /proc/self/uid_map to decide that. If that >>> shows you are in init_user_ns then it uses security.capability, otherwise >>> it uses security.nscapability. >>> >>> I've occasionally considered having the xattr code do the quiet >>> substitution if need be. >>> >>> In fact, much of this structure comes from when I was still trying to >>> do multiple values per xattr. Given what we're doing here, we could >>> keep the xattr contents exactly the same, just changing the name. >>> So userspace could just get and set security.capability; if you are >>> in a non-init user_ns, if security.capability is set then you cannot >>> set it; if security.capability is not set, then the kernel writes >>> security.nscapability instead and returns success. >>> >>> I don't like magic, but this might be just straightforward enough >>> to not be offensive. Thoughts? >> >> Yeah, I think it might be better to have the magic in this case, since >> it seems weird to just reject setxattr if a tool didn't realize it was >> in a namespace. I'm not sure -- it is also nice to have an explicit >> API here. >> >> I would defer to Eric or Michael on that. I keep going back and forth, >> though I suspect it's probably best to do what you already have >> (explicit API). > > Michael, Eric, what do you think? The choice we're making here is > whether we should > > 1. Keep a nice simple separate pair of xattrs, the pre-existing > security.capability which can only be written from init_user_ns, > and the new (in this patch) security.nscapability which you can > write to any file where you are privileged wrt the file. > > 2. Make security.capability somewhat 'magic' - if someone in a > non-initial user ns tries to write it and has privilege wrt the > file, then the kernel silently writes security.nscapability instead. > > The biggest drawback of (1) would be any tar-like program trying > to restore a file which had security.capability, needing to know > to detect its userns and write the security.nscapability instead. > The drawback of (2) is ~\o/~ magic. I have only (minor) thoughts from the interface perspective. (1) Sounds the source of possibly unpleasant surprises. (2) Is a little surprising, but less so if it's well documented, and it saves us the surprises of (1). So, (2) sounds better. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH] Implement leftpad syscall
On 04/01/2016 11:33 AM, Richard Weinberger wrote: > From: David Gstir <da...@sigma-star.at> > > Implement the leftpad() system call such that userspace, > especially node.js applications, can in the near future directly > use it and no longer depend on fragile npm packages. Works can't express the importance of adding this system call! Thanks so much for proposing and implementing it! Acked-by: Michael Kerrisk <mtk.manpa...@gmail.com> Cheers, Michael > Signed-off-by: David Gstir <da...@sigma-star.at> > Signed-off-by: Richard Weinberger <rich...@nod.at> > --- > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/syscalls.h | 1 + > kernel/sys.c | 35 > ++ > kernel/sys_ni.c| 1 + > 4 files changed, 38 insertions(+) > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index cac6d17..f287712 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -335,6 +335,7 @@ > 326 common copy_file_range sys_copy_file_range > 327 64 preadv2 sys_preadv2 > 328 64 pwritev2sys_pwritev2 > +329 common leftpad sys_leftpad > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index d795472..a0850bb 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -898,4 +898,5 @@ asmlinkage long sys_copy_file_range(int fd_in, loff_t > __user *off_in, > > asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags); > > +asmlinkage long sys_leftpad(char *str, char pad, char *dst, size_t dst_len); > #endif > diff --git a/kernel/sys.c b/kernel/sys.c > index cf8ba54..e42d972 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2432,3 +2432,38 @@ COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo > __user *, info) > return 0; > } > #endif /* CONFIG_COMPAT */ > + > + > +SYSCALL_DEFINE4(leftpad, char *, src, char, pad, char *, dst, size_t, > dst_len) > +{ > + char *buf; > + long ret; > + size_t len = strlen_user(src); > + size_t pad_len = dst_len - len; > + > + if (dst_len <= len || dst_len > 4096) { > + return -EINVAL; > + } > + > + buf = kmalloc(dst_len, GFP_KERNEL); > + if (!buf) > + return -ENOMEM; > + > + memset(buf, pad, pad_len); > + ret = copy_from_user(buf + pad_len, src, len); > + if (ret) { > + ret = -EFAULT; > + goto out; > + } > + > + ret = copy_to_user(dst, buf, dst_len); > + if (ret) { > + ret = -EFAULT; > + goto out; > + } > + > + ret = pad_len; > +out: > + kfree(buf); > + return ret; > +} > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 2c5e3a8..262608d 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -175,6 +175,7 @@ cond_syscall(sys_setfsgid); > cond_syscall(sys_capget); > cond_syscall(sys_capset); > cond_syscall(sys_copy_file_range); > +cond_syscall(sys_leftpad); > > /* arch-specific weak syscall entries */ > cond_syscall(sys_pciconfig_read); > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH] Implement leftpad syscall
On 04/01/2016 11:33 AM, Richard Weinberger wrote: > From: David Gstir > > Implement the leftpad() system call such that userspace, > especially node.js applications, can in the near future directly > use it and no longer depend on fragile npm packages. Works can't express the importance of adding this system call! Thanks so much for proposing and implementing it! Acked-by: Michael Kerrisk Cheers, Michael > Signed-off-by: David Gstir > Signed-off-by: Richard Weinberger > --- > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/syscalls.h | 1 + > kernel/sys.c | 35 > ++ > kernel/sys_ni.c| 1 + > 4 files changed, 38 insertions(+) > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index cac6d17..f287712 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -335,6 +335,7 @@ > 326 common copy_file_range sys_copy_file_range > 327 64 preadv2 sys_preadv2 > 328 64 pwritev2sys_pwritev2 > +329 common leftpad sys_leftpad > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index d795472..a0850bb 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -898,4 +898,5 @@ asmlinkage long sys_copy_file_range(int fd_in, loff_t > __user *off_in, > > asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags); > > +asmlinkage long sys_leftpad(char *str, char pad, char *dst, size_t dst_len); > #endif > diff --git a/kernel/sys.c b/kernel/sys.c > index cf8ba54..e42d972 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2432,3 +2432,38 @@ COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo > __user *, info) > return 0; > } > #endif /* CONFIG_COMPAT */ > + > + > +SYSCALL_DEFINE4(leftpad, char *, src, char, pad, char *, dst, size_t, > dst_len) > +{ > + char *buf; > + long ret; > + size_t len = strlen_user(src); > + size_t pad_len = dst_len - len; > + > + if (dst_len <= len || dst_len > 4096) { > + return -EINVAL; > + } > + > + buf = kmalloc(dst_len, GFP_KERNEL); > + if (!buf) > + return -ENOMEM; > + > + memset(buf, pad, pad_len); > + ret = copy_from_user(buf + pad_len, src, len); > + if (ret) { > + ret = -EFAULT; > + goto out; > + } > + > + ret = copy_to_user(dst, buf, dst_len); > + if (ret) { > + ret = -EFAULT; > + goto out; > + } > + > + ret = pad_len; > +out: > + kfree(buf); > + return ret; > +} > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 2c5e3a8..262608d 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -175,6 +175,7 @@ cond_syscall(sys_setfsgid); > cond_syscall(sys_capget); > cond_syscall(sys_capset); > cond_syscall(sys_copy_file_range); > +cond_syscall(sys_leftpad); > > /* arch-specific weak syscall entries */ > cond_syscall(sys_pciconfig_read); > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
man-pages-4.05 is released
Gidday, The Linux man-pages maintainer proudly announces: man-pages-4.05 - man pages for Linux This release includes input and contributions from nearly 70 people. Over 400 pages saw changes, ranging from typo fixes through to page rewrites and newly created pages. Tarball download: http://www.kernel.org/doc/man-pages/download.html Git repository: https://git.kernel.org/cgit/docs/man-pages/man-pages.git/ Online changelog: http://man7.org/linux/man-pages/changelog.html#release_4.05 A short summary of the release is blogged at: http://linux-man-pages.blogspot.com/2016/03/man-pages-405-is-released.html The current version of the pages is browsable at: http://man7.org/linux/man-pages/ A selection of changes in this release that may be of interest yo readers on LKML is shown below. Cheers, Michael Changes in man-pages-4.05 New and rewritten pages --- copy_file_range.2 Anna Schumaker [Darrick J. Wong, Christoph Hellwig, Michael Kerrisk] New page documenting copy_file_range() personality.2 Michael Kerrisk This page has been greatly expanded, to add descriptions of personality domains. fmemopen.3 Michael Kerrisk [Adhemerval Zanella] Significant reworking of this page: * Rework discussion of the (obsolete) binary mode * Split open_memstream(3) description into a separate page. * Note various fmemopen() bugs that were fixed in glibc 2.22 * Greatly expand description of 'mode' argument * Rework description of 'buf' and 'len' arguments * Expand discussion of "current position" for fmemopen() stream ntp_gettime.3 Michael Kerrisk New page describing ntp_gettime(3) and ntp_gettimex(3) open_memstream.3 Michael Kerrisk New page created by split of fmemopen(3). At the same time, add and rework a few details in the text. posix_spawn.3 Bill O. Gallmeister, Michael Kerrisk New man page documenting posix_spawn(3) and posix_spawnp(3) readdir.3 Michael Kerrisk [Florian Weimer] Split readdir_r() content into separate page Michael Kerrisk Near complete restructuring of the page and add some further details Michael Kerrisk [Florian Weimer, Rich Felker, Paul Eggert] Add a lot more detail on portable use of the 'd_name' field readdir_r.3 Michael Kerrisk [Florian Weimer] New page created after split of readdir(3). Michael Kerrisk [Florian Weimer] Explain why readdir_r() is deprecated and readdir() is preferred lirc.4 Alec Leamas New page documenting lirc device driver Newly documented interfaces in existing pages - epoll_ctl.2 Michael Kerrisk [Jason Baron] Document EPOLLEXCLUSIVE madvise.2 Minchan Kim [Michael Kerrisk] Document MADV_FREE Document the MADV_FREE flag added to madvise() in Linux 4.5. proc.5 Michael Kerrisk Document CmaTotal and CmaFree fields of /proc/meminfo Michael Kerrisk Document additional /proc/meminfo fields Document DirectMap4k, DirectMap4M, DirectMap2M, DirectMap1G Michael Kerrisk Document MemAvailable /proc/meminfo field Michael Kerrisk Document inotify /proc/PID/fdinfo entries Michael Kerrisk Document fanotify /proc/PID/fdinfo entries Michael Kerrisk Add some kernel version numbers for /proc/PID/fdinfo entries Michael Kerrisk [Patrick Donnelly] /proc/PID/fdinfo displays the setting of the close-on-exec flag Note also the pre-3.1 bug in the display of this info. socket.7 Craig Gallek [Michael Kerrisk, Vincent Bernat] Document some BPF-related socket options Document the behavior and the first kernel version for each of the following socket options: SO_ATTACH_FILTER SO_ATTACH_BPF SO_ATTACH_REUSEPORT_CBPF SO_ATTACH_REUSEPORT_EBPF SO_DETACH_FILTER SO_DETACH_BPF SO_LOCK_FILTER Global changes -- Many, many pages Michael Kerrisk Update, simplify and correct feature test macro requirements Changes to individual pages --- adjtimex.2 Michael Kerrisk [John Stultz] Various improvements after feedback from John Stultz syscall.2 Mike Frysinger Add more architectures and improve error documentation Move the error register documentation into the main table rather than listing them in sentences after the fact. Add sparc error return details. Add details for alpha/arc/m68k/microblaze/nios2/powerpc/superh/ tile/xtensa. feature_test_macros.7 Michael Kerrisk Add a summary of some FTM key points Michael Kerrisk
man-pages-4.05 is released
Gidday, The Linux man-pages maintainer proudly announces: man-pages-4.05 - man pages for Linux This release includes input and contributions from nearly 70 people. Over 400 pages saw changes, ranging from typo fixes through to page rewrites and newly created pages. Tarball download: http://www.kernel.org/doc/man-pages/download.html Git repository: https://git.kernel.org/cgit/docs/man-pages/man-pages.git/ Online changelog: http://man7.org/linux/man-pages/changelog.html#release_4.05 A short summary of the release is blogged at: http://linux-man-pages.blogspot.com/2016/03/man-pages-405-is-released.html The current version of the pages is browsable at: http://man7.org/linux/man-pages/ A selection of changes in this release that may be of interest yo readers on LKML is shown below. Cheers, Michael Changes in man-pages-4.05 New and rewritten pages --- copy_file_range.2 Anna Schumaker [Darrick J. Wong, Christoph Hellwig, Michael Kerrisk] New page documenting copy_file_range() personality.2 Michael Kerrisk This page has been greatly expanded, to add descriptions of personality domains. fmemopen.3 Michael Kerrisk [Adhemerval Zanella] Significant reworking of this page: * Rework discussion of the (obsolete) binary mode * Split open_memstream(3) description into a separate page. * Note various fmemopen() bugs that were fixed in glibc 2.22 * Greatly expand description of 'mode' argument * Rework description of 'buf' and 'len' arguments * Expand discussion of "current position" for fmemopen() stream ntp_gettime.3 Michael Kerrisk New page describing ntp_gettime(3) and ntp_gettimex(3) open_memstream.3 Michael Kerrisk New page created by split of fmemopen(3). At the same time, add and rework a few details in the text. posix_spawn.3 Bill O. Gallmeister, Michael Kerrisk New man page documenting posix_spawn(3) and posix_spawnp(3) readdir.3 Michael Kerrisk [Florian Weimer] Split readdir_r() content into separate page Michael Kerrisk Near complete restructuring of the page and add some further details Michael Kerrisk [Florian Weimer, Rich Felker, Paul Eggert] Add a lot more detail on portable use of the 'd_name' field readdir_r.3 Michael Kerrisk [Florian Weimer] New page created after split of readdir(3). Michael Kerrisk [Florian Weimer] Explain why readdir_r() is deprecated and readdir() is preferred lirc.4 Alec Leamas New page documenting lirc device driver Newly documented interfaces in existing pages - epoll_ctl.2 Michael Kerrisk [Jason Baron] Document EPOLLEXCLUSIVE madvise.2 Minchan Kim [Michael Kerrisk] Document MADV_FREE Document the MADV_FREE flag added to madvise() in Linux 4.5. proc.5 Michael Kerrisk Document CmaTotal and CmaFree fields of /proc/meminfo Michael Kerrisk Document additional /proc/meminfo fields Document DirectMap4k, DirectMap4M, DirectMap2M, DirectMap1G Michael Kerrisk Document MemAvailable /proc/meminfo field Michael Kerrisk Document inotify /proc/PID/fdinfo entries Michael Kerrisk Document fanotify /proc/PID/fdinfo entries Michael Kerrisk Add some kernel version numbers for /proc/PID/fdinfo entries Michael Kerrisk [Patrick Donnelly] /proc/PID/fdinfo displays the setting of the close-on-exec flag Note also the pre-3.1 bug in the display of this info. socket.7 Craig Gallek [Michael Kerrisk, Vincent Bernat] Document some BPF-related socket options Document the behavior and the first kernel version for each of the following socket options: SO_ATTACH_FILTER SO_ATTACH_BPF SO_ATTACH_REUSEPORT_CBPF SO_ATTACH_REUSEPORT_EBPF SO_DETACH_FILTER SO_DETACH_BPF SO_LOCK_FILTER Global changes -- Many, many pages Michael Kerrisk Update, simplify and correct feature test macro requirements Changes to individual pages --- adjtimex.2 Michael Kerrisk [John Stultz] Various improvements after feedback from John Stultz syscall.2 Mike Frysinger Add more architectures and improve error documentation Move the error register documentation into the main table rather than listing them in sentences after the fact. Add sparc error return details. Add details for alpha/arc/m68k/microblaze/nios2/powerpc/superh/ tile/xtensa. feature_test_macros.7 Michael Kerrisk Add a summary of some FTM key points Michael Kerrisk
Re: [PATCH] epoll: add exclusive wakeups flag
Hi Jason, On 03/15/2016 11:35 AM, Jason Baron wrote: > Hi Michael, > > On 03/14/2016 05:03 PM, Michael Kerrisk (man-pages) wrote: >> Hi Jason, >> >> On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote: >>> Hi Jason, >>> >>> On 03/15/2016 08:32 AM, Jason Baron wrote: >>>> >>>> >>>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote: >>>>> [Restoring CC, which I see I accidentally dropped, one iteration back.] >> >> [...] >> >>>>> Returning to the second sentence in this description: >>>>> >>>>> When a wakeup event occurs and multiple epoll file descrip‐ >>>>> tors are attached to the same target file using EPOLLEXCLU‐ >>>>> SIVE, one or more of the epoll file descriptors will >>>>> receive an event with epoll_wait(2). >>>>> >>>>> There is a point that is unclear to me: what does "target file" refer to? >>>>> Is it an open file description (aka open file table entry) or an inode? >>>>> I suspect the former, but it was not clear in your original text. >>>>> >>>> >>>> So from epoll's perspective, the wakeups are associated with a 'wait >>>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via >>>> file->poll()) results in adding to the same 'wait queue' then we will >>>> get 'exclusive' wakeup behavior. >>>> >>>> So in general, I think the answer here is that its associated with the >>>> inode (I coudn't say with 100% certainty without really looking at all >>>> file->poll() implementations). Certainly, with the 'FIFO' example below, >>>> the two scenarios will have the same behavior with respect to >>>> EPOLLEXCLUSIVE. >> >> So, I was actually a little surprised by this, and went away and tested >> this point. It appears to me that that the two scenarios described below >> do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below. >> >>> So, in both scenarios, *one or more* processes will get a wakeup? >>> (I'll try to add something to the text to clarify the detail we're >>> discussing.) >>> >>>> Also, the 'non-exclusive' mode would be subject to the same question of >>>> which wait queue is the epfd is associated with... >>> >>> I'm not sure of the point you are trying to make here? >>> >>> Cheers, >>> >>> Michael >>> >>> >>>>> To make this point even clearer, here are two scenarios I'm thinking of. >>>>> In each case, we're talking of monitoring the read end of a FIFO. >>>>> >>>>> === >>>>> >>>>> Scenario 1: >>>>> >>>>> We have three processes each of which >>>>> 1. Creates an epoll instance >>>>> 2. Opens the read end of the FIFO >>>>> 3. Adds the read end of the FIFO to the epoll instance, specifying >>>>>EPOLLEXCLUSIVE >>>>> >>>>> When input becomes available on the FIFO, how many processes >>>>> get a wakeup? >> >> When I test this scenario, all three processes get a wakeup. >> >>>>> === >>>>> >>>>> Scenario 3 >>>>> >>>>> A parent process opens the read end of a FIFO and then calls >>>>> fork() three times to create three children. Each child then: >>>>> >>>>> 1. Creates an epoll instance >>>>> 2. Adds the read end of the FIFO to the epoll instance, specifying >>>>> EPOLLEXCLUSIVE >>>>> >>>>> When input becomes available on the FIFO, how many processes >>>>> get a wakeup? >> >> When I test this scenario, one process gets a wakeup. >> >> In other words, "target file" appears to mean open file description >> (aka open file table entry), not inode. >> >> This is actually what I suspected might be the case, but now I am >> puzzled. Given what I've discovered and what you suggest are the >> semantics, is the implementation correct? (I suspect that it is, >> but it is at odds with your statement above. My test programs are >> inline below. >> >> Cheers, >> >> Michael >> > > Thanks for the test cases. So in your first test case, you are exiting > immediatel
Re: [PATCH] epoll: add exclusive wakeups flag
Hi Jason, On 03/15/2016 11:35 AM, Jason Baron wrote: > Hi Michael, > > On 03/14/2016 05:03 PM, Michael Kerrisk (man-pages) wrote: >> Hi Jason, >> >> On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote: >>> Hi Jason, >>> >>> On 03/15/2016 08:32 AM, Jason Baron wrote: >>>> >>>> >>>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote: >>>>> [Restoring CC, which I see I accidentally dropped, one iteration back.] >> >> [...] >> >>>>> Returning to the second sentence in this description: >>>>> >>>>> When a wakeup event occurs and multiple epoll file descrip‐ >>>>> tors are attached to the same target file using EPOLLEXCLU‐ >>>>> SIVE, one or more of the epoll file descriptors will >>>>> receive an event with epoll_wait(2). >>>>> >>>>> There is a point that is unclear to me: what does "target file" refer to? >>>>> Is it an open file description (aka open file table entry) or an inode? >>>>> I suspect the former, but it was not clear in your original text. >>>>> >>>> >>>> So from epoll's perspective, the wakeups are associated with a 'wait >>>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via >>>> file->poll()) results in adding to the same 'wait queue' then we will >>>> get 'exclusive' wakeup behavior. >>>> >>>> So in general, I think the answer here is that its associated with the >>>> inode (I coudn't say with 100% certainty without really looking at all >>>> file->poll() implementations). Certainly, with the 'FIFO' example below, >>>> the two scenarios will have the same behavior with respect to >>>> EPOLLEXCLUSIVE. >> >> So, I was actually a little surprised by this, and went away and tested >> this point. It appears to me that that the two scenarios described below >> do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below. >> >>> So, in both scenarios, *one or more* processes will get a wakeup? >>> (I'll try to add something to the text to clarify the detail we're >>> discussing.) >>> >>>> Also, the 'non-exclusive' mode would be subject to the same question of >>>> which wait queue is the epfd is associated with... >>> >>> I'm not sure of the point you are trying to make here? >>> >>> Cheers, >>> >>> Michael >>> >>> >>>>> To make this point even clearer, here are two scenarios I'm thinking of. >>>>> In each case, we're talking of monitoring the read end of a FIFO. >>>>> >>>>> === >>>>> >>>>> Scenario 1: >>>>> >>>>> We have three processes each of which >>>>> 1. Creates an epoll instance >>>>> 2. Opens the read end of the FIFO >>>>> 3. Adds the read end of the FIFO to the epoll instance, specifying >>>>>EPOLLEXCLUSIVE >>>>> >>>>> When input becomes available on the FIFO, how many processes >>>>> get a wakeup? >> >> When I test this scenario, all three processes get a wakeup. >> >>>>> === >>>>> >>>>> Scenario 3 >>>>> >>>>> A parent process opens the read end of a FIFO and then calls >>>>> fork() three times to create three children. Each child then: >>>>> >>>>> 1. Creates an epoll instance >>>>> 2. Adds the read end of the FIFO to the epoll instance, specifying >>>>> EPOLLEXCLUSIVE >>>>> >>>>> When input becomes available on the FIFO, how many processes >>>>> get a wakeup? >> >> When I test this scenario, one process gets a wakeup. >> >> In other words, "target file" appears to mean open file description >> (aka open file table entry), not inode. >> >> This is actually what I suspected might be the case, but now I am >> puzzled. Given what I've discovered and what you suggest are the >> semantics, is the implementation correct? (I suspect that it is, >> but it is at odds with your statement above. My test programs are >> inline below. >> >> Cheers, >> >> Michael >> > > Thanks for the test cases. So in your first test case, you are exiting > immediatel
Re: [PATCH] epoll: add exclusive wakeups flag
Hi Jason, On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote: > Hi Jason, > > On 03/15/2016 08:32 AM, Jason Baron wrote: >> >> >> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote: >>> [Restoring CC, which I see I accidentally dropped, one iteration back.] [...] >>> Returning to the second sentence in this description: >>> >>> When a wakeup event occurs and multiple epoll file descrip‐ >>> tors are attached to the same target file using EPOLLEXCLU‐ >>> SIVE, one or more of the epoll file descriptors will >>> receive an event with epoll_wait(2). >>> >>> There is a point that is unclear to me: what does "target file" refer to? >>> Is it an open file description (aka open file table entry) or an inode? >>> I suspect the former, but it was not clear in your original text. >>> >> >> So from epoll's perspective, the wakeups are associated with a 'wait >> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via >> file->poll()) results in adding to the same 'wait queue' then we will >> get 'exclusive' wakeup behavior. >> >> So in general, I think the answer here is that its associated with the >> inode (I coudn't say with 100% certainty without really looking at all >> file->poll() implementations). Certainly, with the 'FIFO' example below, >> the two scenarios will have the same behavior with respect to >> EPOLLEXCLUSIVE. So, I was actually a little surprised by this, and went away and tested this point. It appears to me that that the two scenarios described below do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below. > So, in both scenarios, *one or more* processes will get a wakeup? > (I'll try to add something to the text to clarify the detail we're > discussing.) > >> Also, the 'non-exclusive' mode would be subject to the same question of >> which wait queue is the epfd is associated with... > > I'm not sure of the point you are trying to make here? > > Cheers, > > Michael > > >>> To make this point even clearer, here are two scenarios I'm thinking of. >>> In each case, we're talking of monitoring the read end of a FIFO. >>> >>> === >>> >>> Scenario 1: >>> >>> We have three processes each of which >>> 1. Creates an epoll instance >>> 2. Opens the read end of the FIFO >>> 3. Adds the read end of the FIFO to the epoll instance, specifying >>>EPOLLEXCLUSIVE >>> >>> When input becomes available on the FIFO, how many processes >>> get a wakeup? When I test this scenario, all three processes get a wakeup. >>> === >>> >>> Scenario 3 >>> >>> A parent process opens the read end of a FIFO and then calls >>> fork() three times to create three children. Each child then: >>> >>> 1. Creates an epoll instance >>> 2. Adds the read end of the FIFO to the epoll instance, specifying >>> EPOLLEXCLUSIVE >>> >>> When input becomes available on the FIFO, how many processes >>> get a wakeup? When I test this scenario, one process gets a wakeup. In other words, "target file" appears to mean open file description (aka open file table entry), not inode. This is actually what I suspected might be the case, but now I am puzzled. Given what I've discovered and what you suggest are the semantics, is the implementation correct? (I suspect that it is, but it is at odds with your statement above. My test programs are inline below. Cheers, Michael /* t_EPOLLEXCLUSIVE_multipen.c Licensed under GNU GPLv2 or later. */ #include #include #include #include #include #include #include #include #define errExit(msg)do { perror(msg); exit(EXIT_FAILURE); \ } while (0) #define usageErr(msg, progName) \ do { fprintf(stderr, "Usage: "); \ fprintf(stderr, msg, progName); \ exit(EXIT_FAILURE); } while (0) #ifndef EPOLLEXCLUSIVE #define EPOLLEXCLUSIVE (1 << 28) #endif int main(int argc, char *argv[]) { int fd, epfd, nready; struct epoll_event ev, rev; if (argc != 2 || strcmp(argv[1], "--help") == 0) usageErr("%s n", argv[0]); epfd = epoll_create(2); if (epfd == -1) errExit("epoll_create"); fd = open(argv[1], O_RDONLY); if (fd == -1) errExit("open"); printf("Opened %s\n", argv[1]); ev.events = EPOLLIN | EPOLLEXCLUSIVE; if (epoll_ctl(
Re: [PATCH] epoll: add exclusive wakeups flag
Hi Jason, On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote: > Hi Jason, > > On 03/15/2016 08:32 AM, Jason Baron wrote: >> >> >> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote: >>> [Restoring CC, which I see I accidentally dropped, one iteration back.] [...] >>> Returning to the second sentence in this description: >>> >>> When a wakeup event occurs and multiple epoll file descrip‐ >>> tors are attached to the same target file using EPOLLEXCLU‐ >>> SIVE, one or more of the epoll file descriptors will >>> receive an event with epoll_wait(2). >>> >>> There is a point that is unclear to me: what does "target file" refer to? >>> Is it an open file description (aka open file table entry) or an inode? >>> I suspect the former, but it was not clear in your original text. >>> >> >> So from epoll's perspective, the wakeups are associated with a 'wait >> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via >> file->poll()) results in adding to the same 'wait queue' then we will >> get 'exclusive' wakeup behavior. >> >> So in general, I think the answer here is that its associated with the >> inode (I coudn't say with 100% certainty without really looking at all >> file->poll() implementations). Certainly, with the 'FIFO' example below, >> the two scenarios will have the same behavior with respect to >> EPOLLEXCLUSIVE. So, I was actually a little surprised by this, and went away and tested this point. It appears to me that that the two scenarios described below do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below. > So, in both scenarios, *one or more* processes will get a wakeup? > (I'll try to add something to the text to clarify the detail we're > discussing.) > >> Also, the 'non-exclusive' mode would be subject to the same question of >> which wait queue is the epfd is associated with... > > I'm not sure of the point you are trying to make here? > > Cheers, > > Michael > > >>> To make this point even clearer, here are two scenarios I'm thinking of. >>> In each case, we're talking of monitoring the read end of a FIFO. >>> >>> === >>> >>> Scenario 1: >>> >>> We have three processes each of which >>> 1. Creates an epoll instance >>> 2. Opens the read end of the FIFO >>> 3. Adds the read end of the FIFO to the epoll instance, specifying >>>EPOLLEXCLUSIVE >>> >>> When input becomes available on the FIFO, how many processes >>> get a wakeup? When I test this scenario, all three processes get a wakeup. >>> === >>> >>> Scenario 3 >>> >>> A parent process opens the read end of a FIFO and then calls >>> fork() three times to create three children. Each child then: >>> >>> 1. Creates an epoll instance >>> 2. Adds the read end of the FIFO to the epoll instance, specifying >>> EPOLLEXCLUSIVE >>> >>> When input becomes available on the FIFO, how many processes >>> get a wakeup? When I test this scenario, one process gets a wakeup. In other words, "target file" appears to mean open file description (aka open file table entry), not inode. This is actually what I suspected might be the case, but now I am puzzled. Given what I've discovered and what you suggest are the semantics, is the implementation correct? (I suspect that it is, but it is at odds with your statement above. My test programs are inline below. Cheers, Michael /* t_EPOLLEXCLUSIVE_multipen.c Licensed under GNU GPLv2 or later. */ #include #include #include #include #include #include #include #include #define errExit(msg)do { perror(msg); exit(EXIT_FAILURE); \ } while (0) #define usageErr(msg, progName) \ do { fprintf(stderr, "Usage: "); \ fprintf(stderr, msg, progName); \ exit(EXIT_FAILURE); } while (0) #ifndef EPOLLEXCLUSIVE #define EPOLLEXCLUSIVE (1 << 28) #endif int main(int argc, char *argv[]) { int fd, epfd, nready; struct epoll_event ev, rev; if (argc != 2 || strcmp(argv[1], "--help") == 0) usageErr("%s n", argv[0]); epfd = epoll_create(2); if (epfd == -1) errExit("epoll_create"); fd = open(argv[1], O_RDONLY); if (fd == -1) errExit("open"); printf("Opened %s\n", argv[1]); ev.events = EPOLLIN | EPOLLEXCLUSIVE; if (epoll_ctl(