from:"Michael Kerrisk"

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)


On 07/26/2016 04:54 AM, Andrew Vagin wrote:

On Mon, Jul 25, 2016 at 09:59:43AM -0500, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


[snip]


[snip]

So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Please, please make the standard way to compare these things fstat.
That is much less magic than a symlink, and a little more future proof.
Possibly even kcmp.


I like the idea to use kcmp to compare namespaces. I am going to add this
functionality to kcmp and describe all these in the man page.


Hi Andrey,

Can you briefly sketch out the proposed API and how it would be used?
I'd find it useful to see that even before the implementation.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)


On 07/26/2016 04:54 AM, Andrew Vagin wrote:

On Mon, Jul 25, 2016 at 09:59:43AM -0500, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)"  writes:


[snip]


[snip]

So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Please, please make the standard way to compare these things fstat.
That is much less magic than a symlink, and a little more future proof.
Possibly even kcmp.


I like the idea to use kcmp to compare namespaces. I am going to add this
functionality to kcmp and describe all these in the man page.


Hi Andrey,

Can you briefly sketch out the proposed API and how it would be used?
I'd find it useful to see that even before the implementation.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-25 Thread Michael Kerrisk (man-pages)


Hi Eric,

On 07/25/2016 03:18 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


Hi Andrey,

On 07/22/2016 08:25 PM, Andrey Vagin wrote:

On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

Hi Andrey,


On 07/21/2016 11:06 PM, Andrew Vagin wrote:


On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:







Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.



Hi Michael,

A patch for man-pages is attached. It adds the following text to
namespaces(7).

Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
pace file descriptors.  The correct syntax is:

  fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
  Returns a file descriptor that refers to an owning  user  names‐
  pace.

NS_GET_PARENT
  Returns  a  file  descriptor  that refers to a parent namespace.
  This ioctl(2) can be used for pid and user namespaces. For  user
  namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
  ing.


For each of the above, I think it is worth mentioning that the
close-on-exec flag is set for the returned file descriptor.


Hmm.  That is an odd default.


Why do you say that? It's pretty common as the default for various
APIs that create new FDs these days. (There's of course a strong argument
that the original UNIX default was a design blunder...)



In addition to generic ioctl(2) errors, the following specific ones can
occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is  outside  of the current namespace
  scope.


Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
user namespace"?


Having looked at that bit of code I don't think capabilities really
have a role to play.


Yes, I caught up with that now. I await to see how this plays out
in the next patch version.


ENOENT ns_fd refers to the init namespace.



Thanks for this. But still part of the question remains unanswered.
How do we (in user-space) use the file descriptors to answer any of
the questions that this patch series was designed to solve? (This
info should be in the commit message and the man-pages patch.)


I'm sorry, but I am not sure that I understand what you ask.

Here are the origin questions:
Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Here is an example which shows how we can get the owning namespace
inode number by using these ioctl-s.

$ ls -l /proc/13929/ns/pid
lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'

$ ./nsowner /proc/13929/ns/pid
user:[4026532227]

The owning user namespace for pid:[4026532228] is user:[4026532227].

The nsowner  tool is cimpiled from this code:

int main(int argc, char *argv[])
{
char buf[128], path[] = "/proc/self/fd/0123456789";
int ns, uns, ret;

ns = open(argv[1], O_RDONLY);
if (ns < 0)
return 1;

uns = ioctl(ns, NS_GET_USERNS);
if (uns < 0)
return 1;

snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
ret = readlink(path, buf, sizeof(buf) - 1);
if (ret < 0)
return 1;
buf[ret] = 0;

printf("%s\n", buf);

return 0;
}


So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Please, please make the standard way to compare these things fstat.
That is much less magic than a symlink, and a little more future proof.
Possibly even kcmp.


As in fstat() to get the st_ino field, right?

Cheers,

Michael


At some point we will care about migrating a migrating sub-container and we
may have to have some minor changes.

Eric




--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-25 Thread Michael Kerrisk (man-pages)


Hi Eric,

On 07/25/2016 03:18 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)"  writes:


Hi Andrey,

On 07/22/2016 08:25 PM, Andrey Vagin wrote:

On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
 wrote:

Hi Andrey,


On 07/21/2016 11:06 PM, Andrew Vagin wrote:


On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:







Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.



Hi Michael,

A patch for man-pages is attached. It adds the following text to
namespaces(7).

Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
pace file descriptors.  The correct syntax is:

  fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
  Returns a file descriptor that refers to an owning  user  names‐
  pace.

NS_GET_PARENT
  Returns  a  file  descriptor  that refers to a parent namespace.
  This ioctl(2) can be used for pid and user namespaces. For  user
  namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
  ing.


For each of the above, I think it is worth mentioning that the
close-on-exec flag is set for the returned file descriptor.


Hmm.  That is an odd default.


Why do you say that? It's pretty common as the default for various
APIs that create new FDs these days. (There's of course a strong argument
that the original UNIX default was a design blunder...)



In addition to generic ioctl(2) errors, the following specific ones can
occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is  outside  of the current namespace
  scope.


Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
user namespace"?


Having looked at that bit of code I don't think capabilities really
have a role to play.


Yes, I caught up with that now. I await to see how this plays out
in the next patch version.


ENOENT ns_fd refers to the init namespace.



Thanks for this. But still part of the question remains unanswered.
How do we (in user-space) use the file descriptors to answer any of
the questions that this patch series was designed to solve? (This
info should be in the commit message and the man-pages patch.)


I'm sorry, but I am not sure that I understand what you ask.

Here are the origin questions:
Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Here is an example which shows how we can get the owning namespace
inode number by using these ioctl-s.

$ ls -l /proc/13929/ns/pid
lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'

$ ./nsowner /proc/13929/ns/pid
user:[4026532227]

The owning user namespace for pid:[4026532228] is user:[4026532227].

The nsowner  tool is cimpiled from this code:

int main(int argc, char *argv[])
{
char buf[128], path[] = "/proc/self/fd/0123456789";
int ns, uns, ret;

ns = open(argv[1], O_RDONLY);
if (ns < 0)
return 1;

uns = ioctl(ns, NS_GET_USERNS);
if (uns < 0)
return 1;

snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
ret = readlink(path, buf, sizeof(buf) - 1);
if (ret < 0)
return 1;
buf[ret] = 0;

printf("%s\n", buf);

return 0;
}


So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Please, please make the standard way to compare these things fstat.
That is much less magic than a symlink, and a little more future proof.
Possibly even kcmp.


As in fstat() to get the st_ino field, right?

Cheers,

Michael


At some point we will care about migrating a migrating sub-container and we
may have to have some minor changes.

Eric




--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-25 Thread Michael Kerrisk (man-pages)


Hi Andrey,

On 07/22/2016 08:25 PM, Andrey Vagin wrote:

On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

Hi Andrey,


On 07/21/2016 11:06 PM, Andrew Vagin wrote:


On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:







Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.



Hi Michael,

A patch for man-pages is attached. It adds the following text to
namespaces(7).

Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
pace file descriptors.  The correct syntax is:

  fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
  Returns a file descriptor that refers to an owning  user  names‐
  pace.

NS_GET_PARENT
  Returns  a  file  descriptor  that refers to a parent namespace.
  This ioctl(2) can be used for pid and user namespaces. For  user
  namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
  ing.


For each of the above, I think it is worth mentioning that the
close-on-exec flag is set for the returned file descriptor.



In addition to generic ioctl(2) errors, the following specific ones can
occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is  outside  of the current namespace
  scope.


Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
user namespace"?



ENOENT ns_fd refers to the init namespace.



Thanks for this. But still part of the question remains unanswered.
How do we (in user-space) use the file descriptors to answer any of
the questions that this patch series was designed to solve? (This
info should be in the commit message and the man-pages patch.)


I'm sorry, but I am not sure that I understand what you ask.

Here are the origin questions:
Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Here is an example which shows how we can get the owning namespace
inode number by using these ioctl-s.

$ ls -l /proc/13929/ns/pid
lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'

$ ./nsowner /proc/13929/ns/pid
user:[4026532227]

The owning user namespace for pid:[4026532228] is user:[4026532227].

The nsowner  tool is cimpiled from this code:

int main(int argc, char *argv[])
{
char buf[128], path[] = "/proc/self/fd/0123456789";
int ns, uns, ret;

ns = open(argv[1], O_RDONLY);
if (ns < 0)
return 1;

uns = ioctl(ns, NS_GET_USERNS);
if (uns < 0)
return 1;

snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
ret = readlink(path, buf, sizeof(buf) - 1);
if (ret < 0)
return 1;
buf[ret] = 0;

printf("%s\n", buf);

return 0;
}


So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Does this example answer to the origin question?


Yes.


If it isn't, could
you eloborate what you expect to see here.

And I wrote one more example which show all relationships between
namespaces. It enumirates all processes in a system, collects all
namespaces and determins parent and owning namespaces for each of
them, then it constructs a namespace tree and shows it.

Here is a code: https://gist.github.com/avagin/db805f95e15ffb0af7e559dbb8de4418


That's great! Thanks!
 

Here is an example of output for my test system:
[root@fc24 nsfs]# ./nstree
user:[4026531837]
 \__  mnt:[4026532203]
 \__  ipc:[4026531839]
 \__  user:[4026532224]
 \__  user:[4026532226]
 \__  user:[4026532227]
 \__  pid:[4026532228]
 \__  pid:[4026532225]
 \__  pid:[4026532228]
 \__  user:[4026532221]
 \__  pid:[402653]
 \__  user:[4026532223]
 \__  mnt:[4026532211]
 \__  uts:[4026531838]
 \__  cgroup:[4026531835]
 \__  pid:[4026531836]
 \__  pid:[4026532225]
 \__  pid:[4026532228]
 \__  pid:[402653]
 \__  mnt:[4026531857]
 \__  mnt:[4026531840]
 \__  net:[4026531957]


Cheers,

Michael


[1]

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-25 Thread Michael Kerrisk (man-pages)


Hi Andrey,

On 07/22/2016 08:25 PM, Andrey Vagin wrote:

On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
 wrote:

Hi Andrey,


On 07/21/2016 11:06 PM, Andrew Vagin wrote:


On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:







Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.



Hi Michael,

A patch for man-pages is attached. It adds the following text to
namespaces(7).

Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
pace file descriptors.  The correct syntax is:

  fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
  Returns a file descriptor that refers to an owning  user  names‐
  pace.

NS_GET_PARENT
  Returns  a  file  descriptor  that refers to a parent namespace.
  This ioctl(2) can be used for pid and user namespaces. For  user
  namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
  ing.


For each of the above, I think it is worth mentioning that the
close-on-exec flag is set for the returned file descriptor.



In addition to generic ioctl(2) errors, the following specific ones can
occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is  outside  of the current namespace
  scope.


Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
user namespace"?



ENOENT ns_fd refers to the init namespace.



Thanks for this. But still part of the question remains unanswered.
How do we (in user-space) use the file descriptors to answer any of
the questions that this patch series was designed to solve? (This
info should be in the commit message and the man-pages patch.)


I'm sorry, but I am not sure that I understand what you ask.

Here are the origin questions:
Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Here is an example which shows how we can get the owning namespace
inode number by using these ioctl-s.

$ ls -l /proc/13929/ns/pid
lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'

$ ./nsowner /proc/13929/ns/pid
user:[4026532227]

The owning user namespace for pid:[4026532228] is user:[4026532227].

The nsowner  tool is cimpiled from this code:

int main(int argc, char *argv[])
{
char buf[128], path[] = "/proc/self/fd/0123456789";
int ns, uns, ret;

ns = open(argv[1], O_RDONLY);
if (ns < 0)
return 1;

uns = ioctl(ns, NS_GET_USERNS);
if (uns < 0)
return 1;

snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
ret = readlink(path, buf, sizeof(buf) - 1);
if (ret < 0)
return 1;
buf[ret] = 0;

printf("%s\n", buf);

return 0;
}


So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Does this example answer to the origin question?


Yes.


If it isn't, could
you eloborate what you expect to see here.

And I wrote one more example which show all relationships between
namespaces. It enumirates all processes in a system, collects all
namespaces and determins parent and owning namespaces for each of
them, then it constructs a namespace tree and shows it.

Here is a code: https://gist.github.com/avagin/db805f95e15ffb0af7e559dbb8de4418


That's great! Thanks!
 

Here is an example of output for my test system:
[root@fc24 nsfs]# ./nstree
user:[4026531837]
 \__  mnt:[4026532203]
 \__  ipc:[4026531839]
 \__  user:[4026532224]
 \__  user:[4026532226]
 \__  user:[4026532227]
 \__  pid:[4026532228]
 \__  pid:[4026532225]
 \__  pid:[4026532228]
 \__  user:[4026532221]
 \__  pid:[402653]
 \__  user:[4026532223]
 \__  mnt:[4026532211]
 \__  uts:[4026531838]
 \__  cgroup:[4026531835]
 \__  pid:[4026531836]
 \__  pid:[4026532225]
 \__  pid:[4026532228]
 \__  pid:[402653]
 \__  mnt:[4026531857]
 \__  mnt:[4026531840]
 \__  net:[4026531957]


Cheers,

Michael


[1] https://lkml.org/lkml/2016/7/6/15

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-21 Thread Michael Kerrisk (man-pages)


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running system.
Another would be to answer the question: what capability does process X have to
perform operations on a resource governed by namespace Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we age going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:

Grumble, Grumble.  I think this may actually a case for creating ioctls
for these two cases.  Now that random nsfs file descriptors are bind
mountable the original reason for using proc files is not as pressing.

One ioctl for the user namespace that owns a file descriptor.
One ioctl for the parent namespace of a namespace file descriptor.


Here is an implementaions of these ioctl-s.


Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.

Thanks,

Michael



[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Cc: "Eric W. Biederman" <ebied...@xmission.com>
Cc: James Bottomley <james.bottom...@hansenpartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com>
Cc: "W. Trevor King" <wk...@tremily.us>
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hal...@canonical.com>

--
2.5.5





--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-21 Thread Michael Kerrisk (man-pages)


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running system.
Another would be to answer the question: what capability does process X have to
perform operations on a resource governed by namespace Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we age going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:

Grumble, Grumble.  I think this may actually a case for creating ioctls
for these two cases.  Now that random nsfs file descriptors are bind
mountable the original reason for using proc files is not as pressing.

One ioctl for the user namespace that owns a file descriptor.
One ioctl for the parent namespace of a namespace file descriptor.


Here is an implementaions of these ioctl-s.


Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.

Thanks,

Michael



[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Cc: "Eric W. Biederman" 
Cc: James Bottomley 
Cc: "Michael Kerrisk (man-pages)" 
Cc: "W. Trevor King" 
Cc: Alexander Viro 
Cc: Serge Hallyn 

--
2.5.5





--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

man-pages-4.07 is released

2016-07-18 Thread Michael Kerrisk (man-pages)




Gidday,

The Linux man-pages maintainer proudly announces:

man-pages-4.07 - man pages for Linux

This release includes input and contributions from
around 50 people. Over 140 pages saw changes, ranging
from typo fixes through to page rewrites and 4 newly
created pages.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_4.07

A short summary of the release is blogged at:
http://linux-man-pages.blogspot.com/2016/07/man-pages-407-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers on LKML is shown below.

Cheers,

Michael

 Changes in man-pages-4.07 

Released: 2016-07-17, Ulm


New and rewritten pages
---

ioctl_fideduperange.2
Darrick J. Wong  [Christoph Hellwig, Michael Kerrisk]
New page documenting the FIDEDUPERANGE ioctl
Document the FIDEDUPERANGE ioctl, formerly known as
BTRFS_IOC_EXTENT_SAME.

ioctl_ficlonerange.2
Darrick J. Wong  [Christoph Hellwig, Michael Kerrisk]
New page documenting FICLONE and FICLONERANGE ioctls
Document the FICLONE and FICLONERANGE ioctls, formerly known as
the BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls.

mount_namespaces.7
Michael Kerrisk  [Michael Kerrisk]
New page describing mount namespaces


Newly documented interfaces in existing pages
-

mount.2
Michael Kerrisk
Document flags used to set propagation type
Document MS_SHARED, MS_PRIVATE, MS_SLAVE, and MS_UNBINDABLE.
Michael Kerrisk
Document the MS_REC flag

ptrace.2
Michael Kerrisk  [Kees Cook, Jann Horn, Eric W. Biederman, Stephen Smalley]
Document ptrace access modes

proc.5
Michael Kerrisk
Document /proc/[pid]/timerslack_ns
Michael Kerrisk
Document /proc/PID/status 'Ngid' field
Michael Kerrisk
Document /proc/PID/status fields: 'NStgid', 'NSpid', 'NSpgid', 'NSsid'
Michael Kerrisk
Document /proc/PID/status 'Umask' field


Changes to individual pages
---

ldd.1
Michael Kerrisk
Add a little more detail on why ldd is unsafe with untrusted executables

futex.2
Michael Kerrisk
Correct an ENOSYS error description
Since Linux 4.5, FUTEX_CLOCK_REALTIME is allowed with FUTEX_WAIT.
Michael Kerrisk  [Darren Hart]
Remove crufty text about FUTEX_WAIT_BITSET interpretation of timeout
Since Linux 4.5, FUTEX_WAIT also understands
FUTEX_CLOCK_REALTIME.
Michael Kerrisk  [Thomas Gleixner]
Explain how to get equivalent of FUTEX_WAIT with an absolute timeout
Michael Kerrisk
Describe FUTEX_BITSET_MATCH_ANY
Describe FUTEX_BITSET_MATCH_ANY and FUTEX_WAIT and FUTEX_WAKE
equivalences.
Michael Kerrisk  [Thomas Gleixner, Darren Hart]
Fix descriptions of various timeouts
Michael Kerrisk
Clarify clock default and choices for FUTEX_WAIT

kcmp.2
Michael Kerrisk
kcmp() is governed by PTRACE_MODE_READ_REALCREDS

mount.2
Michael Kerrisk
Restructure discussion of 'mountflags' into functional groups
The existing text makes no differentiation between different
"classes" of mount flags. However, certain flags such as
MS_REMOUNT, MS_BIND, MS_MOVE, etc. determine the general
type of operation that mount() performs. Furthermore, the
choice of which class of operation to perform is performed in
a certain order, and that order is significant if multiple
flags are specified. Restructure and extend the text to
reflect these details.
    Michael Kerrisk
Since Linux 2.6.26, bind mounts can be made read-only

process_vm_readv.2
    Michael Kerrisk
Rephrase permission rules in terms of a ptrace access mode check

ptrace.2
    Michael Kerrisk  [Jann Horn]
Update Yama ptrace_scope documentation
Reframe the discussion in terms of PTRACE_MODE_ATTACH checks,
and make a few other minor tweaks and additions.
    Michael Kerrisk, Jann Horn
Note that user namespaces can be used to bypass Yama protections
    Michael Kerrisk
Note that PTRACE_SEIZE is subject to a ptrace access mode check
    Michael Kerrisk
Rephrase PTRACE_ATTACH permissions in terms of ptrace access mode check

wait.2
    Michael Kerrisk
Since Linux 4.7, __WALL is implied if child being ptraced
    Michael Kerrisk
waitid() now (since Linux 4.7) also supports __WNOTHREAD/__WCLONE/__WALL

proc.5
    Michael Kerrisk
/proc/PID/fd/* ar

man-pages-4.07 is released

2016-07-18 Thread Michael Kerrisk (man-pages)




Gidday,

The Linux man-pages maintainer proudly announces:

man-pages-4.07 - man pages for Linux

This release includes input and contributions from
around 50 people. Over 140 pages saw changes, ranging
from typo fixes through to page rewrites and 4 newly
created pages.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_4.07

A short summary of the release is blogged at:
http://linux-man-pages.blogspot.com/2016/07/man-pages-407-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers on LKML is shown below.

Cheers,

Michael

 Changes in man-pages-4.07 

Released: 2016-07-17, Ulm


New and rewritten pages
---

ioctl_fideduperange.2
Darrick J. Wong  [Christoph Hellwig, Michael Kerrisk]
New page documenting the FIDEDUPERANGE ioctl
Document the FIDEDUPERANGE ioctl, formerly known as
BTRFS_IOC_EXTENT_SAME.

ioctl_ficlonerange.2
Darrick J. Wong  [Christoph Hellwig, Michael Kerrisk]
New page documenting FICLONE and FICLONERANGE ioctls
Document the FICLONE and FICLONERANGE ioctls, formerly known as
the BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls.

mount_namespaces.7
Michael Kerrisk  [Michael Kerrisk]
New page describing mount namespaces


Newly documented interfaces in existing pages
-

mount.2
Michael Kerrisk
Document flags used to set propagation type
Document MS_SHARED, MS_PRIVATE, MS_SLAVE, and MS_UNBINDABLE.
Michael Kerrisk
Document the MS_REC flag

ptrace.2
Michael Kerrisk  [Kees Cook, Jann Horn, Eric W. Biederman, Stephen Smalley]
Document ptrace access modes

proc.5
Michael Kerrisk
Document /proc/[pid]/timerslack_ns
Michael Kerrisk
Document /proc/PID/status 'Ngid' field
Michael Kerrisk
Document /proc/PID/status fields: 'NStgid', 'NSpid', 'NSpgid', 'NSsid'
Michael Kerrisk
Document /proc/PID/status 'Umask' field


Changes to individual pages
---

ldd.1
Michael Kerrisk
Add a little more detail on why ldd is unsafe with untrusted executables

futex.2
Michael Kerrisk
Correct an ENOSYS error description
Since Linux 4.5, FUTEX_CLOCK_REALTIME is allowed with FUTEX_WAIT.
Michael Kerrisk  [Darren Hart]
Remove crufty text about FUTEX_WAIT_BITSET interpretation of timeout
Since Linux 4.5, FUTEX_WAIT also understands
FUTEX_CLOCK_REALTIME.
Michael Kerrisk  [Thomas Gleixner]
Explain how to get equivalent of FUTEX_WAIT with an absolute timeout
Michael Kerrisk
Describe FUTEX_BITSET_MATCH_ANY
Describe FUTEX_BITSET_MATCH_ANY and FUTEX_WAIT and FUTEX_WAKE
equivalences.
Michael Kerrisk  [Thomas Gleixner, Darren Hart]
Fix descriptions of various timeouts
Michael Kerrisk
Clarify clock default and choices for FUTEX_WAIT

kcmp.2
Michael Kerrisk
kcmp() is governed by PTRACE_MODE_READ_REALCREDS

mount.2
Michael Kerrisk
Restructure discussion of 'mountflags' into functional groups
The existing text makes no differentiation between different
"classes" of mount flags. However, certain flags such as
MS_REMOUNT, MS_BIND, MS_MOVE, etc. determine the general
type of operation that mount() performs. Furthermore, the
choice of which class of operation to perform is performed in
a certain order, and that order is significant if multiple
flags are specified. Restructure and extend the text to
reflect these details.
    Michael Kerrisk
Since Linux 2.6.26, bind mounts can be made read-only

process_vm_readv.2
    Michael Kerrisk
Rephrase permission rules in terms of a ptrace access mode check

ptrace.2
    Michael Kerrisk  [Jann Horn]
Update Yama ptrace_scope documentation
Reframe the discussion in terms of PTRACE_MODE_ATTACH checks,
and make a few other minor tweaks and additions.
    Michael Kerrisk, Jann Horn
Note that user namespaces can be used to bypass Yama protections
    Michael Kerrisk
Note that PTRACE_SEIZE is subject to a ptrace access mode check
    Michael Kerrisk
Rephrase PTRACE_ATTACH permissions in terms of ptrace access mode check

wait.2
    Michael Kerrisk
Since Linux 4.7, __WALL is implied if child being ptraced
    Michael Kerrisk
waitid() now (since Linux 4.7) also supports __WNOTHREAD/__WCLONE/__WALL

proc.5
    Michael Kerrisk
/proc/PID/fd/* ar

Re: Bugzilla spam

2016-07-13 Thread Michael Kerrisk (man-pages)

Hello Konstantin,

On 13 July 2016 at 20:37, Konstantin Ryabitsev <mri...@kernel.org> wrote:
> On Wed, Jul 13, 2016 at 08:28:18PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Konstantin,
>>
>> The man-pages Bugzilla component (as well as other components on
>> Bugzilla by the look of things) is receiving vast quantities of spam.
>> What can be done about this? (Just marking the bugs private and
>> closing isn't workable. There's just too many bugs coming in...).
>
> Not much can be done. :( Bugzilla's default spam-fighting capabilities
> are abysmal -- I can't even delete any accounts without installing
> multiple extensions. I'm actively investigating what we can do to
> improve the situation and will follow up shortly.

Okay, thanks. In the meantime, is it possible for you to lock the
man-pages component so that no further bug reports can be made via
that component?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Bugzilla spam

2016-07-13 Thread Michael Kerrisk (man-pages)

Hello Konstantin,

On 13 July 2016 at 20:37, Konstantin Ryabitsev  wrote:
> On Wed, Jul 13, 2016 at 08:28:18PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Konstantin,
>>
>> The man-pages Bugzilla component (as well as other components on
>> Bugzilla by the look of things) is receiving vast quantities of spam.
>> What can be done about this? (Just marking the bugs private and
>> closing isn't workable. There's just too many bugs coming in...).
>
> Not much can be done. :( Bugzilla's default spam-fighting capabilities
> are abysmal -- I can't even delete any accounts without installing
> multiple extensions. I'm actively investigating what we can do to
> improve the situation and will follow up shortly.

Okay, thanks. In the meantime, is it possible for you to lock the
man-pages component so that no further bug reports can be made via
that component?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Bugzilla spam

2016-07-13 Thread Michael Kerrisk (man-pages)

Hello Konstantin,

The man-pages Bugzilla component (as well as other components on
Bugzilla by the look of things) is receiving vast quantities of spam.
What can be done about this? (Just marking the bugs private and
closing isn't workable. There's just too many bugs coming in...).

Thanks

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Bugzilla spam

2016-07-13 Thread Michael Kerrisk (man-pages)

Hello Konstantin,

The man-pages Bugzilla component (as well as other components on
Bugzilla by the look of things) is receiving vast quantities of spam.
What can be done about this? (Just marking the bugs private and
closing isn't workable. There's just too many bugs coming in...).

Thanks

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)


On 07/08/2016 05:26 AM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
wrote:

On 7 July 2016 at 17:01, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.
 So
yeah you could probably get a tree into a state that would
be
wrongly recreated. Create a new netns, bind mount it, exit;
  Have
another task create a new user_ns, bind mount it, exit;
 Third
task setns()s first to the new netns then to the new
user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it:
so
root can set up a nesting hierarchy, bind it and destroy the
pids
but I know of no current orchestration system which does
this.

Actually, I have to back pedal a bit: the way I currently set
up
architecture emulation containers does precisely this: I set
up the
namespaces unprivileged with child mount namespaces, but then
I ask
root to bind the userns and kill the process that created it
so I
have a permanent handle to enter the namespace by, so I
suspect
that when our current orchestration systems get more
sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an
option
(just add a show_options entry to the superblock ops), but
the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace
specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all
namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I
thought you
were having is that there's no way of seeing what it is.

nsfs is the Namespace fileystem where bound namespaces appear to
a cat
of /proc/self/mounts.  It can display any information that's in
ns_common (the common core of namespaces) but the owning user_ns
pointer currently isn't in this structure.  Every user namespace
has a
pointer to it, but they're all privately embedded in the
individual
namespace specific structures.  What I was proposing was that
since
every current namespace has a pointer somewhere to the owning
user
namespace, we could abstract this out into ns_common so it's now
accessible to be displayed by nsfs, probably as a mount option.


James, I am not sure that I understood you correctly. We have one
file system for all namespace files, how we can show per-file
properties
in mount options. I think we can show all required information in
fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
/proc/pid/fdinfo/X for it.


Here is a proof-of-concept patch.

How it works:

In [1]: import os

In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)

In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
pos:0
flags:  010
mnt_id: 2
userns: 4026531837

In [4]: print "/proc/self/ns/user -> %s" %
os.readlink("/proc/self/ns/user")
/proc/self/ns/user -> user:[4026531837]


can't you just do

readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

?

But what Michael was asking about was the parent user_ns of all the
other namespaces ...


Just to reiterate, what I'm interested in is the introspection use
case (but there's clearly several other interesting use cases here).
The idea is to be able to answer these questions

1. For each userns, what is the parent of that userns?

2. For each non-user namespace, what is the owning userns?

This enables us to understand the userns hierarchy, which
matters in terms of answering the question: what capabilities
does process X have in namespace Y?
   
Cheers,


Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)


On 07/08/2016 05:26 AM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
wrote:

On 7 July 2016 at 17:01, James Bottomley
 wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.
 So
yeah you could probably get a tree into a state that would
be
wrongly recreated. Create a new netns, bind mount it, exit;
  Have
another task create a new user_ns, bind mount it, exit;
 Third
task setns()s first to the new netns then to the new
user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it:
so
root can set up a nesting hierarchy, bind it and destroy the
pids
but I know of no current orchestration system which does
this.

Actually, I have to back pedal a bit: the way I currently set
up
architecture emulation containers does precisely this: I set
up the
namespaces unprivileged with child mount namespaces, but then
I ask
root to bind the userns and kill the process that created it
so I
have a permanent handle to enter the namespace by, so I
suspect
that when our current orchestration systems get more
sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an
option
(just add a show_options entry to the superblock ops), but
the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace
specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all
namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I
thought you
were having is that there's no way of seeing what it is.

nsfs is the Namespace fileystem where bound namespaces appear to
a cat
of /proc/self/mounts.  It can display any information that's in
ns_common (the common core of namespaces) but the owning user_ns
pointer currently isn't in this structure.  Every user namespace
has a
pointer to it, but they're all privately embedded in the
individual
namespace specific structures.  What I was proposing was that
since
every current namespace has a pointer somewhere to the owning
user
namespace, we could abstract this out into ns_common so it's now
accessible to be displayed by nsfs, probably as a mount option.


James, I am not sure that I understood you correctly. We have one
file system for all namespace files, how we can show per-file
properties
in mount options. I think we can show all required information in
fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
/proc/pid/fdinfo/X for it.


Here is a proof-of-concept patch.

How it works:

In [1]: import os

In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)

In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
pos:0
flags:  010
mnt_id: 2
userns: 4026531837

In [4]: print "/proc/self/ns/user -> %s" %
os.readlink("/proc/self/ns/user")
/proc/self/ns/user -> user:[4026531837]


can't you just do

readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

?

But what Michael was asking about was the parent user_ns of all the
other namespaces ...


Just to reiterate, what I'm interested in is the introspection use
case (but there's clearly several other interesting use cases here).
The idea is to be able to answer these questions

1. For each userns, what is the parent of that userns?

2. For each non-user namespace, what is the owning userns?

This enables us to understand the userns hierarchy, which
matters in terms of answering the question: what capabilities
does process X have in namespace Y?
   
Cheers,


Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)


On 07/07/2016 09:17 PM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:

On 7 July 2016 at 17:01, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.  So
yeah you could probably get a tree into a state that would be
wrongly recreated. Create a new netns, bind mount it, exit;  Have
another task create a new user_ns, bind mount it, exit;  Third
task setns()s first to the new netns then to the new user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it: so
root can set up a nesting hierarchy, bind it and destroy the pids
but I know of no current orchestration system which does this.

Actually, I have to back pedal a bit: the way I currently set up
architecture emulation containers does precisely this: I set up the
namespaces unprivileged with child mount namespaces, but then I ask
root to bind the userns and kill the process that created it so I
have a permanent handle to enter the namespace by, so I suspect
that when our current orchestration systems get more sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an option
(just add a show_options entry to the superblock ops), but the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I thought you
were having is that there's no way of seeing what it is.


Your words "and enforce all namespaces having owning user_ns" were
what left me puzzled--it sounded to me that the implication was
that this is not "enforced" right now.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)


On 07/07/2016 09:17 PM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:

On 7 July 2016 at 17:01, James Bottomley
 wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.  So
yeah you could probably get a tree into a state that would be
wrongly recreated. Create a new netns, bind mount it, exit;  Have
another task create a new user_ns, bind mount it, exit;  Third
task setns()s first to the new netns then to the new user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it: so
root can set up a nesting hierarchy, bind it and destroy the pids
but I know of no current orchestration system which does this.

Actually, I have to back pedal a bit: the way I currently set up
architecture emulation containers does precisely this: I set up the
namespaces unprivileged with child mount namespaces, but then I ask
root to bind the userns and kill the process that created it so I
have a permanent handle to enter the namespace by, so I suspect
that when our current orchestration systems get more sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an option
(just add a show_options entry to the superblock ops), but the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I thought you
were having is that there's no way of seeing what it is.


Your words "and enforce all namespaces having owning user_ns" were
what left me puzzled--it sounded to me that the implication was
that this is not "enforced" right now.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)

On 7 July 2016 at 17:01, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:
> On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
>> > Hi Serge,
>> >
>> > On 6 July 2016 at 16:13, Serge E. Hallyn <se...@hallyn.com> wrote:
>> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
>> > > -pages) wrote:
>> > > > [Rats! Doing now what I should have down to start with. Looping
>> > > > some lists and CRIU and other possibly relevant people into
>> > > > this conversation]
>> > > >
>> > > > Hi Eric,
>> > > >
>> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > ebied...@xmission.com> wrote:
>> > > > > "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com>
>> > > > > writes:
>> > > > >
>> > > > > > Hi Eric,
>> > > > > >
>> > > > > > I have a question. Is there any way currently to discover
>> > > > > > which user namespace a particular nonuser namespace is
>> > > > > > governed by? Maybe I am missing something, but there does
>> > > > > > not seem to be a way to do this. Also, can one discover
>> > > > > > which userns is the parent of a given userns? Again, I
>> > > > > > can't see a way to do this.
>> > > > > >
>> > > > > > The point here is introspecting so that a process might
>> > > > > > determine what its capabilities are when operating on some
>> > > > > > resource governed by a (nonuser) namespace.
>> > > > >
>> > > > > To the best of my knowledge that there is not an interface to
>> > > > > get that information.  It would be good to have such an
>> > > > > interface for no other reason than the CRIU folks are going
>> > > > > to need it at some point.  I am a bit surprised they have not
>> > > > > complained yet.
>> > >
>> > > I don't think they need it.  They do in fact have what they need.
>> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
>> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1
>> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
>> > > does not matter.
>> > >
>> > > At restart, it doesn't matter which task originally created the
>> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it
>> > > creates the userns, sets up the mapping, and T1_1 and T2_1
>> > > setns() to it.
>> >
>> > I'm missing something here. How does the parental relationships
>> > between the user namespaces get reconstructed? Those relationships
>> > will govern what capabilities a process will have in various user
>> > namespaces.
>
> Actually, you get the parent namespace from the process tree by
> tracking the user namespaces of the parent pids.   Currently non-root
> users can't bind the namespace, so the only way to keep a new user_ns
> around if you're not root is to keep the process around, so for
> multiply nested user namespaces you can usually build the user_ns
> hierarchy by looking at the process hierarchy.  Conversely, if the
> process is reparented to init, chances are that the user_ns is also
> parented to init_user_ns.

Yes, but "chances are" == this isn't robust.  PR_SET_CHILD_SUBREAPER
further complicates things.

By the way, is that really what happens? Do child user namespaces get
reparented to the grandparent ns if the parent ns disappears (i.e.,
ceases to have any members and no bind mounts)? I hadn't thought about
that scenario before. It may be worth documenting in
user_namespaces(7).

>> Hm.  Probably best-effort based on the process hierarchy.  So yeah
>> you could probably get a tree into a state that would be wrongly
>> recreated. Create a new netns, bind mount it, exit;  Have another
>> task create a new user_ns, bind mount it, exit;  Third task setns()s
>> first to the new netns then to the new user_ns.  I suspect criu will
>> recreate that wrongly.
>
> This is a bit pathological, and you have to be root to do it: so root
> can set up a nesting hierarchy, bind it and destroy the pids but I know
> of no current orchestration system which does this.
>
> Actually, I have to back pedal a bit: the way I currently set up
> architec

Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)

On 7 July 2016 at 17:01, James Bottomley
 wrote:
> On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
>> > Hi Serge,
>> >
>> > On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
>> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
>> > > -pages) wrote:
>> > > > [Rats! Doing now what I should have down to start with. Looping
>> > > > some lists and CRIU and other possibly relevant people into
>> > > > this conversation]
>> > > >
>> > > > Hi Eric,
>> > > >
>> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > ebied...@xmission.com> wrote:
>> > > > > "Michael Kerrisk (man-pages)" 
>> > > > > writes:
>> > > > >
>> > > > > > Hi Eric,
>> > > > > >
>> > > > > > I have a question. Is there any way currently to discover
>> > > > > > which user namespace a particular nonuser namespace is
>> > > > > > governed by? Maybe I am missing something, but there does
>> > > > > > not seem to be a way to do this. Also, can one discover
>> > > > > > which userns is the parent of a given userns? Again, I
>> > > > > > can't see a way to do this.
>> > > > > >
>> > > > > > The point here is introspecting so that a process might
>> > > > > > determine what its capabilities are when operating on some
>> > > > > > resource governed by a (nonuser) namespace.
>> > > > >
>> > > > > To the best of my knowledge that there is not an interface to
>> > > > > get that information.  It would be good to have such an
>> > > > > interface for no other reason than the CRIU folks are going
>> > > > > to need it at some point.  I am a bit surprised they have not
>> > > > > complained yet.
>> > >
>> > > I don't think they need it.  They do in fact have what they need.
>> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
>> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1
>> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
>> > > does not matter.
>> > >
>> > > At restart, it doesn't matter which task originally created the
>> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it
>> > > creates the userns, sets up the mapping, and T1_1 and T2_1
>> > > setns() to it.
>> >
>> > I'm missing something here. How does the parental relationships
>> > between the user namespaces get reconstructed? Those relationships
>> > will govern what capabilities a process will have in various user
>> > namespaces.
>
> Actually, you get the parent namespace from the process tree by
> tracking the user namespaces of the parent pids.   Currently non-root
> users can't bind the namespace, so the only way to keep a new user_ns
> around if you're not root is to keep the process around, so for
> multiply nested user namespaces you can usually build the user_ns
> hierarchy by looking at the process hierarchy.  Conversely, if the
> process is reparented to init, chances are that the user_ns is also
> parented to init_user_ns.

Yes, but "chances are" == this isn't robust.  PR_SET_CHILD_SUBREAPER
further complicates things.

By the way, is that really what happens? Do child user namespaces get
reparented to the grandparent ns if the parent ns disappears (i.e.,
ceases to have any members and no bind mounts)? I hadn't thought about
that scenario before. It may be worth documenting in
user_namespaces(7).

>> Hm.  Probably best-effort based on the process hierarchy.  So yeah
>> you could probably get a tree into a state that would be wrongly
>> recreated. Create a new netns, bind mount it, exit;  Have another
>> task create a new user_ns, bind mount it, exit;  Third task setns()s
>> first to the new netns then to the new user_ns.  I suspect criu will
>> recreate that wrongly.
>
> This is a bit pathological, and you have to be root to do it: so root
> can set up a nesting hierarchy, bind it and destroy the pids but I know
> of no current orchestration system which does this.
>
> Actually, I have to back pedal a bit: the way I currently set up
> architecture emulation containers does precisely this: I set up the
> namespaces unprivileged wi

Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)

Hi Serge,

On 6 July 2016 at 16:13, Serge E. Hallyn <se...@hallyn.com> wrote:
> On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
>> [Rats! Doing now what I should have down to start with. Looping some
>> lists and CRIU and other possibly relevant people into this
>> conversation]
>>
>> Hi Eric,
>>
>> On 5 July 2016 at 23:47, Eric W. Biederman <ebied...@xmission.com> wrote:
>> > "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:
>> >
>> >> Hi Eric,
>> >>
>> >> I have a question. Is there any way currently to discover which
>> >> user namespace a particular nonuser namespace is governed by?
>> >> Maybe I am missing something, but there does not seem to be a
>> >> way to do this. Also, can one discover which userns is the
>> >> parent of a given userns? Again, I can't see a way to do this.
>> >>
>> >> The point here is introspecting so that a process might determine
>> >> what its capabilities are when operating on some resource governed
>> >> by a (nonuser) namespace.
>> >
>> > To the best of my knowledge that there is not an interface to get that
>> > information.  It would be good to have such an interface for no other
>> > reason than the CRIU folks are going to need it at some point.  I am a
>> > bit surprised they have not complained yet.
>
> I don't think they need it.  They do in fact have what they need.  Assume
> you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> There's some {handwave} uid mapping, does not matter.
>
> At restart, it doesn't matter which task originally created the new userns.
> criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
> up the mapping, and T1_1 and T2_1 setns() to it.

I'm missing something here. How does the parental relationships
between the user namespaces get reconstructed? Those relationships
will govern what capabilities a process will have in various user
namespaces.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)

Hi Serge,

On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
>> [Rats! Doing now what I should have down to start with. Looping some
>> lists and CRIU and other possibly relevant people into this
>> conversation]
>>
>> Hi Eric,
>>
>> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
>> > "Michael Kerrisk (man-pages)"  writes:
>> >
>> >> Hi Eric,
>> >>
>> >> I have a question. Is there any way currently to discover which
>> >> user namespace a particular nonuser namespace is governed by?
>> >> Maybe I am missing something, but there does not seem to be a
>> >> way to do this. Also, can one discover which userns is the
>> >> parent of a given userns? Again, I can't see a way to do this.
>> >>
>> >> The point here is introspecting so that a process might determine
>> >> what its capabilities are when operating on some resource governed
>> >> by a (nonuser) namespace.
>> >
>> > To the best of my knowledge that there is not an interface to get that
>> > information.  It would be good to have such an interface for no other
>> > reason than the CRIU folks are going to need it at some point.  I am a
>> > bit surprised they have not complained yet.
>
> I don't think they need it.  They do in fact have what they need.  Assume
> you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> There's some {handwave} uid mapping, does not matter.
>
> At restart, it doesn't matter which task originally created the new userns.
> criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
> up the mapping, and T1_1 and T2_1 setns() to it.

I'm missing something here. How does the parental relationships
between the user namespaces get reconstructed? Those relationships
will govern what capabilities a process will have in various user
namespaces.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Michael Kerrisk (man-pages)

[Rats! Doing now what I should have down to start with. Looping some
lists and CRIU and other possibly relevant people into this
conversation]

Hi Eric,

On 5 July 2016 at 23:47, Eric W. Biederman <ebied...@xmission.com> wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:
>
>> Hi Eric,
>>
>> I have a question. Is there any way currently to discover which
>> user namespace a particular nonuser namespace is governed by?
>> Maybe I am missing something, but there does not seem to be a
>> way to do this. Also, can one discover which userns is the
>> parent of a given userns? Again, I can't see a way to do this.
>>
>> The point here is introspecting so that a process might determine
>> what its capabilities are when operating on some resource governed
>> by a (nonuser) namespace.
>
> To the best of my knowledge that there is not an interface to get that
> information.  It would be good to have such an interface for no other
> reason than the CRIU folks are going to need it at some point.  I am a
> bit surprised they have not complained yet.
>
> That said in a normal use scenario I don't think that information is
> needed.
>
> Do you have a particular use case besides checkpoint/restart where this
> is useful?  That might help in coming up with a good userspace interface
> for this information.

So, I spend a moderate amount of time working with people to introduce
them to the namespaces infrastructure, and one topic that comes up now
and this introspection/visualization tools. For example,
nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
in /proc/PID--it's possible to (and someone I was working with did)
write tools that introspect the PID namespace hierarchy to show all of
process's and their PIDs in the various namespace instance. It's a
natural enough thing to want to do, when confronted with the
complexity of the namespaces.

Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Michael Kerrisk (man-pages)

[Rats! Doing now what I should have down to start with. Looping some
lists and CRIU and other possibly relevant people into this
conversation]

Hi Eric,

On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> "Michael Kerrisk (man-pages)"  writes:
>
>> Hi Eric,
>>
>> I have a question. Is there any way currently to discover which
>> user namespace a particular nonuser namespace is governed by?
>> Maybe I am missing something, but there does not seem to be a
>> way to do this. Also, can one discover which userns is the
>> parent of a given userns? Again, I can't see a way to do this.
>>
>> The point here is introspecting so that a process might determine
>> what its capabilities are when operating on some resource governed
>> by a (nonuser) namespace.
>
> To the best of my knowledge that there is not an interface to get that
> information.  It would be good to have such an interface for no other
> reason than the CRIU folks are going to need it at some point.  I am a
> bit surprised they have not complained yet.
>
> That said in a normal use scenario I don't think that information is
> needed.
>
> Do you have a particular use case besides checkpoint/restart where this
> is useful?  That might help in coming up with a good userspace interface
> for this information.

So, I spend a moderate amount of time working with people to introduce
them to the namespaces infrastructure, and one topic that comes up now
and this introspection/visualization tools. For example,
nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
in /proc/PID--it's possible to (and someone I was working with did)
write tools that introspect the PID namespace hierarchy to show all of
process's and their PIDs in the various namespace instance. It's a
natural enough thing to want to do, when confronted with the
complexity of the namespaces.

Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)


Hi Kees,

On 06/28/2016 10:55 PM, Kees Cook wrote:

On Mon, Jun 27, 2016 at 11:11 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

Hi Jann,


On 06/25/2016 04:30 PM, Jann Horn wrote:


On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.


Maybe clarify "additional credentials that may exist in memory only and thus..."


Done.



   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.



(namespaced) CAP_SYS_PTRACE is also sufficient here.


Both here and in the "admin-only attach" case, it is IMO important to
note that creating a user namespace effectively removes the Yama
protection because the owner of a namespace, when accessing its
contents from outside, is relatively capable.

This means that when a process tries to use namespaces to sandbox
itself, it inadvertently makes itself more accessible.

(This could probably be worked around in the kernel, but such a
workaround would likely not be default, but rather opt-in via a new
flag for clone() and unshare() or so.)



Tanks for catching this!

So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.  By default, the predefined rela‐
  tionship is that the target process must be a child of  the
  caller.


More accurately, must be a descendant of the caller (grand child is fine, etc).


Thanks, Fixed.





  A  target  process  can  employ the prctl(2) PR_SET_PTRACER
  operation to declare a different PID

Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)


Hi Kees,

On 06/28/2016 10:55 PM, Kees Cook wrote:

On Mon, Jun 27, 2016 at 11:11 PM, Michael Kerrisk (man-pages)
 wrote:

Hi Jann,


On 06/25/2016 04:30 PM, Jann Horn wrote:


On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.


Maybe clarify "additional credentials that may exist in memory only and thus..."


Done.



   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.



(namespaced) CAP_SYS_PTRACE is also sufficient here.


Both here and in the "admin-only attach" case, it is IMO important to
note that creating a user namespace effectively removes the Yama
protection because the owner of a namespace, when accessing its
contents from outside, is relatively capable.

This means that when a process tries to use namespaces to sandbox
itself, it inadvertently makes itself more accessible.

(This could probably be worked around in the kernel, but such a
workaround would likely not be default, but rather opt-in via a new
flag for clone() and unshare() or so.)



Tanks for catching this!

So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.  By default, the predefined rela‐
  tionship is that the target process must be a child of  the
  caller.


More accurately, must be a descendant of the caller (grand child is fine, etc).


Thanks, Fixed.





  A  target  process  can  employ the prctl(2) PR_SET_PTRACER
  operation to declare a different PID  that  is  allowed  to

Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)


Hi Jann,
...


So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.


Nit: The grammar in this sentence seems wrong to me.
s/or it have/or it must have/?


Yep, thanks for catching that. Fixed now.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)


Hi Jann,
...


So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.


Nit: The grammar in this sentence seems wrong to me.
s/or it have/or it must have/?


Yep, thanks for catching that. Fixed now.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)


Hi Jann,

On 06/25/2016 04:30 PM, Jann Horn wrote:

On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages) wrote:

Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.

   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.


(namespaced) CAP_SYS_PTRACE is also sufficient here.


Both here and in the "admin-only attach" case, it is IMO important to
note that creating a user namespace effectively removes the Yama
protection because the owner of a namespace, when accessing its
contents from outside, is relatively capable.

This means that when a process tries to use namespaces to sandbox
itself, it inadvertently makes itself more accessible.

(This could probably be worked around in the kernel, but such a
workaround would likely not be default, but rather opt-in via a new
flag for clone() and unshare() or so.)


Tanks for catching this!

So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.  By default, the predefined rela‐
  tionship is that the target process must be a child of  the
  caller.

  A  target  process  can  employ the prctl(2) PR_SET_PTRACER
  operation to declare a different PID  that  is  allowed  to
  perform  PTRACE_MODE_ATTACH  operations on the target.  See
  the kernel source file Documentation/security/Yama.txt  for
  further details.

  The use of PTRACE_TRACEME is unchanged.

   2 ("admin-only attach")
  Only  processes  with  the CAP_SYS_PTRACE capability

Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)


Hi Jann,

On 06/25/2016 04:30 PM, Jann Horn wrote:

On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages) wrote:

Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.

   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.


(namespaced) CAP_SYS_PTRACE is also sufficient here.


Both here and in the "admin-only attach" case, it is IMO important to
note that creating a user namespace effectively removes the Yama
protection because the owner of a namespace, when accessing its
contents from outside, is relatively capable.

This means that when a process tries to use namespaces to sandbox
itself, it inadvertently makes itself more accessible.

(This could probably be worked around in the kernel, but such a
workaround would likely not be default, but rather opt-in via a new
flag for clone() and unshare() or so.)


Tanks for catching this!

So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.  By default, the predefined rela‐
  tionship is that the target process must be a child of  the
  caller.

  A  target  process  can  employ the prctl(2) PR_SET_PTRACER
  operation to declare a different PID  that  is  allowed  to
  perform  PTRACE_MODE_ATTACH  operations on the target.  See
  the kernel source file Documentation/security/Yama.txt  for
  further details.

  The use of PTRACE_TRACEME is unchanged.

   2 ("admin-only attach")
  Only  processes  with  the CAP_SYS_PTRACE capability

Re: [PATCH v2 2/2] namespaces: add transparent user namespaces

2016-06-26 Thread Michael Kerrisk

Hi Jann,

Patches such as this really should CC linux-api@ (added).

On Sat, Jun 25, 2016 at 2:23 AM, Jann Horn  wrote:
> This allows the admin of a user namespace to mark the namespace as
> transparent. All other namespaces, by default, are opaque.
>
> While the current behavior of user namespaces is appropriate for use in
> containers, there are many programs that only use user namespaces because
> doing so enables them to do other things (e.g. unsharing the mount or
> network namespace) that require namespaced capabilities. For them, the
> inability to see the real UIDs and GIDs of things from inside the user
> namespace can be very annoying.
>
> In a transparent namespace, all UIDs and GIDs that are mapped into its
> first opaque ancestor are visible and are not remapped. This means that if
> a process e.g. stat()s the real root directory in a namespace, it will
> still see it as owned by UID 0.
>
> Traditionally, any UID or GID that was visible in a user namespace was also
> mapped into the namespace, giving the namespace admin full access to it.
> This patch introduces a distinction: In a transparent namespace, UIDs and
> GIDs can be visible without being mapped. Non-mapped, visible UIDs can be
> passed from the kernel to userspace, but userspace can't send them back to
> the kernel.

Can you explain "can't send them back to the kernel" in more detail?
(Some examples of what is and isn't possible would be helpul.)

> In order to be able to fully use specific UIDs/GIDs and gain
> privileges over them, mappings need to be set up in the usual way -
> however, to avoid aliasing problems, only identity mappings are permitted.
>
> v2:
> Ensure that all relevant from_k[ug]id callers show up in the patch.
> _transparent would be more verbose than _tp, but considering the line
> length rule, that's just too long.
>
> Yes, this makes the patch rather large.
>
> Behavior should be the same as in v1, except that I'm not touching orangefs
> in this patch because every single use of from_k[ug]id in it is wrong in
> some way. (Thanks for making me reread all that stuff, Eric.) I'll write a
> separate patch or at least report the issue with more detail later.
>
> (Also, the handling of user namespaces when dealing with signals is
> super-ugly and kind of incorrect. That should probably be cleaned up.)

I'm curious about this detail: can you say some more about the issues here?

> posix_acl_to_xattr would have changed behavior in the v1 patch, but isn't
> changed here. Because it's only used with init_user_ns, that won't change
> user-visible behavior relative to v1.
>
> This patch was compile-tested with allyesconfig. I also ran a VM with this
> patch applied and checked that it still works, but that probably doesn't
> mean much.

One of the things notably lacking from this commit message is any sort
of description of the user-space-API changes that it makes. I presume
it's a matter of some /proc files. Could you explain the changes (ad
add that detail in any further commit message)?

Thanks,

Michael

> Signed-off-by: Jann Horn 
> ---
>  arch/alpha/kernel/osf_sys.c   |   4 +-
>  arch/arm/kernel/sys_oabi-compat.c |   4 +-
>  arch/ia64/kernel/signal.c |   4 +-
>  arch/s390/kernel/compat_linux.c   |  26 +++---
>  arch/sparc/kernel/sys_sparc32.c   |   4 +-
>  arch/x86/ia32/sys_ia32.c  |   4 +-
>  drivers/android/binder.c  |   2 +-
>  drivers/gpu/drm/drm_info.c|   2 +-
>  drivers/gpu/drm/drm_ioctl.c   |   2 +-
>  drivers/net/tun.c |   4 +-
>  fs/autofs4/dev-ioctl.c|   4 +-
>  fs/autofs4/waitq.c|   4 +-
>  fs/binfmt_elf.c   |  12 +--
>  fs/binfmt_elf_fdpic.c |  12 +--
>  fs/compat.c   |   4 +-
>  fs/fcntl.c|   4 +-
>  fs/ncpfs/ioctl.c  |  12 +--
>  fs/posix_acl.c|  11 ++-
>  fs/proc/array.c   |  18 ++--
>  fs/proc/base.c|  30 +--
>  fs/quota/kqid.c   |  12 ++-
>  fs/stat.c |  12 +--
>  include/linux/uidgid.h|  24 +++--
>  include/linux/user_namespace.h|   4 +
>  include/net/scm.h |   4 +-
>  ipc/mqueue.c  |   2 +-
>  ipc/msg.c |   8 +-
>  ipc/sem.c |   8 +-
>  ipc/shm.c |   8 +-
>  ipc/util.c|   8 +-
>  kernel/acct.c |   4 +-
>  kernel/exit.c |   6 +-
>  kernel/groups.c   |   2 +-
>  kernel/signal.c   |  16 ++--
>  kernel/sys.c  |  24 ++---
>  kernel/trace/trace.c  |   2 +-
>  kernel/tsacct.c   |   4 +-
>  kernel/uid16.c|  22 ++---
>  kernel/user.c |   1 +
>  kernel/user_namespace.c   | 178 
>

Re: [PATCH v2 2/2] namespaces: add transparent user namespaces

2016-06-26 Thread Michael Kerrisk

Hi Jann,

Patches such as this really should CC linux-api@ (added).

On Sat, Jun 25, 2016 at 2:23 AM, Jann Horn  wrote:
> This allows the admin of a user namespace to mark the namespace as
> transparent. All other namespaces, by default, are opaque.
>
> While the current behavior of user namespaces is appropriate for use in
> containers, there are many programs that only use user namespaces because
> doing so enables them to do other things (e.g. unsharing the mount or
> network namespace) that require namespaced capabilities. For them, the
> inability to see the real UIDs and GIDs of things from inside the user
> namespace can be very annoying.
>
> In a transparent namespace, all UIDs and GIDs that are mapped into its
> first opaque ancestor are visible and are not remapped. This means that if
> a process e.g. stat()s the real root directory in a namespace, it will
> still see it as owned by UID 0.
>
> Traditionally, any UID or GID that was visible in a user namespace was also
> mapped into the namespace, giving the namespace admin full access to it.
> This patch introduces a distinction: In a transparent namespace, UIDs and
> GIDs can be visible without being mapped. Non-mapped, visible UIDs can be
> passed from the kernel to userspace, but userspace can't send them back to
> the kernel.

Can you explain "can't send them back to the kernel" in more detail?
(Some examples of what is and isn't possible would be helpul.)

> In order to be able to fully use specific UIDs/GIDs and gain
> privileges over them, mappings need to be set up in the usual way -
> however, to avoid aliasing problems, only identity mappings are permitted.
>
> v2:
> Ensure that all relevant from_k[ug]id callers show up in the patch.
> _transparent would be more verbose than _tp, but considering the line
> length rule, that's just too long.
>
> Yes, this makes the patch rather large.
>
> Behavior should be the same as in v1, except that I'm not touching orangefs
> in this patch because every single use of from_k[ug]id in it is wrong in
> some way. (Thanks for making me reread all that stuff, Eric.) I'll write a
> separate patch or at least report the issue with more detail later.
>
> (Also, the handling of user namespaces when dealing with signals is
> super-ugly and kind of incorrect. That should probably be cleaned up.)

I'm curious about this detail: can you say some more about the issues here?

> posix_acl_to_xattr would have changed behavior in the v1 patch, but isn't
> changed here. Because it's only used with init_user_ns, that won't change
> user-visible behavior relative to v1.
>
> This patch was compile-tested with allyesconfig. I also ran a VM with this
> patch applied and checked that it still works, but that probably doesn't
> mean much.

One of the things notably lacking from this commit message is any sort
of description of the user-space-API changes that it makes. I presume
it's a matter of some /proc files. Could you explain the changes (ad
add that detail in any further commit message)?

Thanks,

Michael

> Signed-off-by: Jann Horn 
> ---
>  arch/alpha/kernel/osf_sys.c   |   4 +-
>  arch/arm/kernel/sys_oabi-compat.c |   4 +-
>  arch/ia64/kernel/signal.c |   4 +-
>  arch/s390/kernel/compat_linux.c   |  26 +++---
>  arch/sparc/kernel/sys_sparc32.c   |   4 +-
>  arch/x86/ia32/sys_ia32.c  |   4 +-
>  drivers/android/binder.c  |   2 +-
>  drivers/gpu/drm/drm_info.c|   2 +-
>  drivers/gpu/drm/drm_ioctl.c   |   2 +-
>  drivers/net/tun.c |   4 +-
>  fs/autofs4/dev-ioctl.c|   4 +-
>  fs/autofs4/waitq.c|   4 +-
>  fs/binfmt_elf.c   |  12 +--
>  fs/binfmt_elf_fdpic.c |  12 +--
>  fs/compat.c   |   4 +-
>  fs/fcntl.c|   4 +-
>  fs/ncpfs/ioctl.c  |  12 +--
>  fs/posix_acl.c|  11 ++-
>  fs/proc/array.c   |  18 ++--
>  fs/proc/base.c|  30 +--
>  fs/quota/kqid.c   |  12 ++-
>  fs/stat.c |  12 +--
>  include/linux/uidgid.h|  24 +++--
>  include/linux/user_namespace.h|   4 +
>  include/net/scm.h |   4 +-
>  ipc/mqueue.c  |   2 +-
>  ipc/msg.c |   8 +-
>  ipc/sem.c |   8 +-
>  ipc/shm.c |   8 +-
>  ipc/util.c|   8 +-
>  kernel/acct.c |   4 +-
>  kernel/exit.c |   6 +-
>  kernel/groups.c   |   2 +-
>  kernel/signal.c   |  16 ++--
>  kernel/sys.c  |  24 ++---
>  kernel/trace/trace.c  |   2 +-
>  kernel/tsacct.c   |   4 +-
>  kernel/uid16.c|  22 ++---
>  kernel/user.c |   1 +
>  kernel/user_namespace.c   | 178 
> +++---
>

Review of ptrace Yama ptrace_scope description

2016-06-25 Thread Michael Kerrisk (man-pages)


Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.

   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.

   2 ("admin-only attach")
  Only processes with the  CAP_SYS_PTRACE  capability  may
  perform  PTRACE_MODE_ATTACH operations or trace children
  that employ PTRACE_TRACEME.

   3 ("no attach")
  No process may perform PTRACE_MODE_ATTACH operations  or
  trace children that employ PTRACE_TRACEME.

  Once  this value has been written to the file, it cannot
      be changed.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Review of ptrace Yama ptrace_scope description

2016-06-25 Thread Michael Kerrisk (man-pages)


Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.

   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.

   2 ("admin-only attach")
  Only processes with the  CAP_SYS_PTRACE  capability  may
  perform  PTRACE_MODE_ATTACH operations or trace children
  that employ PTRACE_TRACEME.

   3 ("no attach")
  No process may perform PTRACE_MODE_ATTACH operations  or
  trace children that employ PTRACE_TRACEME.

  Once  this value has been written to the file, it cannot
      be changed.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-25 Thread Michael Kerrisk (man-pages)


On 06/24/2016 05:18 PM, Casey Schaufler wrote:



On 6/24/2016 1:40 AM, Michael Kerrisk (man-pages) wrote:

On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:

#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

return 0;
}


So, I just want to check my understanding of a couple of points:

1. The commoncap LSM is invoked first, and if it denies access,
   then no further LSM is/needs to be called.


Yes. The LSM infrastructure is "bail on fail".



2. Is it the case that only one of the other LSMs (SELinux, Yama,
   AppArmor, etc.) is invoked, or can more than one be invoked.
   I thought only one is invoked, but perhaps I am out of date
   in my understanding.


All registered modules are invoked, but only one "major"
module can be registered. The "minor" modules show up in
security_init, while the majors come in via do_security_initcalls.

I am in the process of messing that all up with patches
allowing multiple major modules. Stay tuned.


Thanks for the info, Casey.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-25 Thread Michael Kerrisk (man-pages)


On 06/24/2016 05:18 PM, Casey Schaufler wrote:



On 6/24/2016 1:40 AM, Michael Kerrisk (man-pages) wrote:

On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
 wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:

#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

return 0;
}


So, I just want to check my understanding of a couple of points:

1. The commoncap LSM is invoked first, and if it denies access,
   then no further LSM is/needs to be called.


Yes. The LSM infrastructure is "bail on fail".



2. Is it the case that only one of the other LSMs (SELinux, Yama,
   AppArmor, etc.) is invoked, or can more than one be invoked.
   I thought only one is invoked, but perhaps I am out of date
   in my understanding.


All registered modules are invoked, but only one "major"
module can be registered. The "minor" modules show up in
security_init, while the majors come in via do_security_initcalls.

I am in the process of messing that all up with patches
allowing multiple major modules. Stay tuned.


Thanks for the info, Casey.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-24 Thread Michael Kerrisk (man-pages)


On 06/24/2016 11:52 AM, Thomas Gleixner wrote:

On Fri, 24 Jun 2016, Michael Kerrisk (man-pages) wrote:

By the way, I just realized something that wasn't initially obvious
to me, and documented it in the futex(2) man page:

  Note:  for  FUTEX_WAIT,  timeout is interpreted as a
  relative value.  This differs from other futex oper‐
  ations,  where timeout is interpreted as an absolute
  value.  To obtain the equivalent of FUTEX_WAIT  with
  an  absolute  timeout, employ FUTEX_WAIT_BITSET with
  val3 specified as FUTEX_BITSET_MATCH_ANY.

Okay?


Yes.


Thanks, Thomas.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-24 Thread Michael Kerrisk (man-pages)


On 06/24/2016 11:52 AM, Thomas Gleixner wrote:

On Fri, 24 Jun 2016, Michael Kerrisk (man-pages) wrote:

By the way, I just realized something that wasn't initially obvious
to me, and documented it in the futex(2) man page:

  Note:  for  FUTEX_WAIT,  timeout is interpreted as a
  relative value.  This differs from other futex oper‐
  ations,  where timeout is interpreted as an absolute
  value.  To obtain the equivalent of FUTEX_WAIT  with
  an  absolute  timeout, employ FUTEX_WAIT_BITSET with
  val3 specified as FUTEX_BITSET_MATCH_ANY.

Okay?


Yes.


Thanks, Thomas.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)


Hi Eric,

On 06/23/2016 09:04 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


Hi Eric,

On 06/21/2016 09:55 PM, Eric W. Biederman wrote:

Hmm.

When I gave this level of detail about the user namespace permission
checks you gave me some flack, because it was not particularly
comprehensible to the end users.  I think you deserve the same feedback.

How do we say this in a way that does not describes a useful way to
think about it.  I read this and I know a lot of what is going on and my
mind goes numb.

How about something like this:

   If the callers uid and gid are the same as a processes uids and gids
   and the processes is configured to allow core dumps (aka it was never
   setuid or setgid) then the caller is allowed to ptrace a process.

   Otherwise the caller must have CAP_SYS_PTRACE.

   Linux security modules impose additional restrictions.

   For consistency access to various process attributes are guarded with
   the same security checks as the ptrace system call itself.  As they are
   all methods to get information about a process.

We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Thanks for the feedback, but I think more detail is required than you
suggest. (And I added all of that detail somewhat reluctantly.)
See my other replies for my rationale.


What I saw badly missing from your description is not the level of
detail but bring things into a form that ordinary mortals can
understand.

For an explanation to be clear I think we very much need the high level
overview first.  Then we can expand that description with the very
detailed view.

I very much think we need to describe things in such a way that people
understand the principles behind the permission checks, and not just
have the documentation echo the code, so that people can know what weird
things LSMs like yama are likely to do, and how these checks are likely
to evolve in the future.


So, I completely agree with you, and I agree that this could be better.
At first, I understood your meaning to be that I should avoid all of the
detail, and just limit the man page to some very high level text as
you proposed. So, I think it's worth prefixing the details with some
attempt at a high-level picture. How about this as an introductory
paragraph:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations),  require  so-called  "ptrace  access mode" checks,
   whose outcome determines whether an operation is permitted (or,
   in  a  few cases, causes a "read" operation to return sanitized
   data).  These checks are performed in cases where  one  process
   can  inspect sensitive information about, or in some cases mod‐
   ify the state of, another process.  The  checks  are  based  on
   factors  such  as  the  credentials and capabilities of the two
   processes, whether or not the "target" process is dumpable, and
   the  results  of checks performed by any enabled Linux Security
   Module (LSM)—for example, SELinux, Yama, or  Smack—and  by  the
   commoncap LSM (which is always invoked).

?


Because one thing is clear to me.  The evolution of these details is
clearly not done, and will continue to change in the future.


Maybe people will even write man page patches when that happens :-).

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)


Hi Eric,

On 06/23/2016 09:04 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)"  writes:


Hi Eric,

On 06/21/2016 09:55 PM, Eric W. Biederman wrote:

Hmm.

When I gave this level of detail about the user namespace permission
checks you gave me some flack, because it was not particularly
comprehensible to the end users.  I think you deserve the same feedback.

How do we say this in a way that does not describes a useful way to
think about it.  I read this and I know a lot of what is going on and my
mind goes numb.

How about something like this:

   If the callers uid and gid are the same as a processes uids and gids
   and the processes is configured to allow core dumps (aka it was never
   setuid or setgid) then the caller is allowed to ptrace a process.

   Otherwise the caller must have CAP_SYS_PTRACE.

   Linux security modules impose additional restrictions.

   For consistency access to various process attributes are guarded with
   the same security checks as the ptrace system call itself.  As they are
   all methods to get information about a process.

We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Thanks for the feedback, but I think more detail is required than you
suggest. (And I added all of that detail somewhat reluctantly.)
See my other replies for my rationale.


What I saw badly missing from your description is not the level of
detail but bring things into a form that ordinary mortals can
understand.

For an explanation to be clear I think we very much need the high level
overview first.  Then we can expand that description with the very
detailed view.

I very much think we need to describe things in such a way that people
understand the principles behind the permission checks, and not just
have the documentation echo the code, so that people can know what weird
things LSMs like yama are likely to do, and how these checks are likely
to evolve in the future.


So, I completely agree with you, and I agree that this could be better.
At first, I understood your meaning to be that I should avoid all of the
detail, and just limit the man page to some very high level text as
you proposed. So, I think it's worth prefixing the details with some
attempt at a high-level picture. How about this as an introductory
paragraph:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations),  require  so-called  "ptrace  access mode" checks,
   whose outcome determines whether an operation is permitted (or,
   in  a  few cases, causes a "read" operation to return sanitized
   data).  These checks are performed in cases where  one  process
   can  inspect sensitive information about, or in some cases mod‐
   ify the state of, another process.  The  checks  are  based  on
   factors  such  as  the  credentials and capabilities of the two
   processes, whether or not the "target" process is dumpable, and
   the  results  of checks performed by any enabled Linux Security
   Module (LSM)—for example, SELinux, Yama, or  Smack—and  by  the
   commoncap LSM (which is always invoked).

?


Because one thing is clear to me.  The evolution of these details is
clearly not done, and will continue to change in the future.


Maybe people will even write man page patches when that happens :-).

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)


On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:

#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

return 0;
}


So, I just want to check my understanding of a couple of points:

1. The commoncap LSM is invoked first, and if it denies access,
   then no further LSM is/needs to be called.

2. Is it the case that only one of the other LSMs (SELinux, Yama,
   AppArmor, etc.) is invoked, or can more than one be invoked.
   I thought only one is invoked, but perhaps I am out of date
   in my understanding.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)


On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
 wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:

#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

return 0;
}


So, I just want to check my understanding of a couple of points:

1. The commoncap LSM is invoked first, and if it denies access,
   then no further LSM is/needs to be called.

2. Is it the case that only one of the other LSMs (SELinux, Yama,
   AppArmor, etc.) is invoked, or can more than one be invoked.
   I thought only one is invoked, but perhaps I am out of date
   in my understanding.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)


Stephen,

On 06/23/2016 08:05 PM, Stephen Smalley wrote:

On 06/21/2016 05:41 AM, Michael Kerrisk (man-pages) wrote:

Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.

Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.

[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)


That was the intent when the distinction was introduced, but it doesn't
appear to have been properly maintained, e.g. there is now a common
helper lock_trace() that is used for
/proc/pid/{stack,syscall,personality} but checks PTRACE_MODE_ATTACH, and
PTRACE_MODE_ATTACH is also used in timerslack_ns_write/show().  Likely
should review and make them consistent.  There was also some debate
about proper handling of /proc/pid/fd.  Arguably that one might belong
back in the _ATTACH camp.


Thanks for the background info.


   Since  Linux  4.5, the above access mode checks may be combined
   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]


Some ptrace access mode checks, such as checks when reading
/proc/pid/stat, merely cause the output to be filtered/sanitized rather
than an error to be returned to the caller.  In these cases, accessing
the file is not a security violation and there is no reason to generate
a security audit record.  This modifier suppresses the generation of
such an audit record for the particular access check.


Thanks, I've added that text to the man page more or less as you
gave it here.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)


Stephen,

On 06/23/2016 08:05 PM, Stephen Smalley wrote:

On 06/21/2016 05:41 AM, Michael Kerrisk (man-pages) wrote:

Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.

Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.

[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)


That was the intent when the distinction was introduced, but it doesn't
appear to have been properly maintained, e.g. there is now a common
helper lock_trace() that is used for
/proc/pid/{stack,syscall,personality} but checks PTRACE_MODE_ATTACH, and
PTRACE_MODE_ATTACH is also used in timerslack_ns_write/show().  Likely
should review and make them consistent.  There was also some debate
about proper handling of /proc/pid/fd.  Arguably that one might belong
back in the _ATTACH camp.


Thanks for the background info.


   Since  Linux  4.5, the above access mode checks may be combined
   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]


Some ptrace access mode checks, such as checks when reading
/proc/pid/stat, merely cause the output to be filtered/sanitized rather
than an error to be returned to the caller.  In these cases, accessing
the file is not a security violation and there is no reason to generate
a security audit record.  This modifier suppresses the generation of
such an audit record for the particular access check.


Thanks, I've added that text to the man page more or less as you
gave it here.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)


On 06/23/2016 08:56 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


Hi Oleg,

On 06/22/2016 11:51 PM, Oleg Nesterov wrote:

On 06/21, Eric W. Biederman wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.


so I have to admit that I never even tried to actually understand
ptrace_may_access ;)


We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Personally I agree. I think "man ptrace" shouldn't not tell too much
about kernel internals.


See my other replies on this topic. Somehow, we need a way of
describing the behavior that user-space sees. I think it's
inevitable that that means talking about what;s going on
"under the hood".

Regarding Eric's point that "we run the danger of people beginning
to think they can depend upon bugs in the implementation": when it
comes to breaking the ABI, the presence or absence of documentation
doesn't save us on that point (Linus has a few times made his position
wrt to documentation clear).


Which are interesting in this respect as a bug in the implementation
that is a security issue can and will be changed, even if userspace
breaks.  Breaking userspace is not desirable but when there is no other
reasonable choice it will happen.


Yes, good point.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)


On 06/23/2016 08:56 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)"  writes:


Hi Oleg,

On 06/22/2016 11:51 PM, Oleg Nesterov wrote:

On 06/21, Eric W. Biederman wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.


so I have to admit that I never even tried to actually understand
ptrace_may_access ;)


We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Personally I agree. I think "man ptrace" shouldn't not tell too much
about kernel internals.


See my other replies on this topic. Somehow, we need a way of
describing the behavior that user-space sees. I think it's
inevitable that that means talking about what;s going on
"under the hood".

Regarding Eric's point that "we run the danger of people beginning
to think they can depend upon bugs in the implementation": when it
comes to breaking the ABI, the presence or absence of documentation
doesn't save us on that point (Linus has a few times made his position
wrt to documentation clear).


Which are interesting in this respect as a bug in the implementation
that is a security issue can and will be changed, even if userspace
breaks.  Breaking userspace is not desirable but when there is no other
reasonable choice it will happen.


Yes, good point.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-24 Thread Michael Kerrisk (man-pages)


On 06/23/2016 09:53 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 08:35:15PM +0200, Michael Kerrisk (man-pages) wrote:

Hi Darren,

On 06/23/2016 06:16 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:
Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner <t...@linutronix.de> wrote:

On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:

And that universe would love to have your documentation of
FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),


I give you almost the full treatment, but I leave REQUEUE_PI to Darren
and FUTEX_WAKE_OP to Jakub. :)
[...]
FUTEX_CLOCK_REALTIME

This option bit can be ored on the futex ops FUTEX_WAIT_BITSET
and FUTEX_WAIT_REQUEUE_PI

If set the kernel treats the user space supplied timeout as
absolute time based on CLOCK_REALTIME.

If not set the kernel treats the user space supplied timeout
as relative time.

Unfortunately, I should have checked the code more carefully...


Me too :)


Seems to be going around...




Looking more carefully at the code, I see understand the situation
is the following:

FUTEX_LOCK_PI
Always uses CLOCK_REALTIME
'timeout' is absolute


Yes.


FUTEX_WAIT_REQUEUE_PI
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT_BITSET
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is relative


Yes.


I've amended the man page to describe those details.


OK, that confirms my question, timeout interpretation as relative or absolute is
based on the op code, not the CLOCK flag.




The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time.


When you say that the "flag was added", which flag do you mean? Or, did you
mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute
time".


I didn't express myself clearly. When Darren added the support for
CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout
support. Anything else does not make sense.


I sent that patch because reading the new man page it struck me as strange that
FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not,
especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to
ALL.

I didn't realize the impact to relative/absolute interpretation of the timeout
value at the time.

I think it was a mistake to introduce a change that made FUTEX_WAIT interpret
the timeout differently based on the CLOCK flag,


I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT
always interprets the clock as relative, regardless of presence/absence of
FUTEX_CLOCK_REALTIME? Am I missing something?


No you're not. The code as it stands today is always relative, but it gets the
base time from the wrong clock source in the case of FUTEX_CLOCK_REALTIME.


Ahh yes, I'd clicked to that, but forgot to say so.


I was stating that I think it would be a mistake to add absolute timeout to
FUTEX_WAIT based on the FUTEX_CLOCK_REALTIME flag, which is how Thomas describes
above his interpretation of my earlier change.


Got it now. Thanks for the clarification, Darren.

Cheers

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-24 Thread Michael Kerrisk (man-pages)


On 06/23/2016 09:53 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 08:35:15PM +0200, Michael Kerrisk (man-pages) wrote:

Hi Darren,

On 06/23/2016 06:16 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:
Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner  wrote:

On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:

And that universe would love to have your documentation of
FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),


I give you almost the full treatment, but I leave REQUEUE_PI to Darren
and FUTEX_WAKE_OP to Jakub. :)
[...]
FUTEX_CLOCK_REALTIME

This option bit can be ored on the futex ops FUTEX_WAIT_BITSET
and FUTEX_WAIT_REQUEUE_PI

If set the kernel treats the user space supplied timeout as
absolute time based on CLOCK_REALTIME.

If not set the kernel treats the user space supplied timeout
as relative time.

Unfortunately, I should have checked the code more carefully...


Me too :)


Seems to be going around...




Looking more carefully at the code, I see understand the situation
is the following:

FUTEX_LOCK_PI
Always uses CLOCK_REALTIME
'timeout' is absolute


Yes.


FUTEX_WAIT_REQUEUE_PI
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT_BITSET
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is relative


Yes.


I've amended the man page to describe those details.


OK, that confirms my question, timeout interpretation as relative or absolute is
based on the op code, not the CLOCK flag.




The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time.


When you say that the "flag was added", which flag do you mean? Or, did you
mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute
time".


I didn't express myself clearly. When Darren added the support for
CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout
support. Anything else does not make sense.


I sent that patch because reading the new man page it struck me as strange that
FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not,
especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to
ALL.

I didn't realize the impact to relative/absolute interpretation of the timeout
value at the time.

I think it was a mistake to introduce a change that made FUTEX_WAIT interpret
the timeout differently based on the CLOCK flag,


I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT
always interprets the clock as relative, regardless of presence/absence of
FUTEX_CLOCK_REALTIME? Am I missing something?


No you're not. The code as it stands today is always relative, but it gets the
base time from the wrong clock source in the case of FUTEX_CLOCK_REALTIME.


Ahh yes, I'd clicked to that, but forgot to say so.


I was stating that I think it would be a mistake to add absolute timeout to
FUTEX_WAIT based on the FUTEX_CLOCK_REALTIME flag, which is how Thomas describes
above his interpretation of my earlier change.


Got it now. Thanks for the clarification, Darren.

Cheers

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)


On 06/23/2016 08:28 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 07:26:52PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:
In my opinion, we should treat the timeout value as relative for FUTEX_WAIT
regardless of the CLOCK used.


Which requires even more changes as you have to select which clock you are
using for adding the base time.


Right, something like the following?


diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..c39d807 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,8 +3230,12 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, 
u32, val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
-   t = ktime_add_safe(ktime_get(), t);
+   if (cmd == FUTEX_WAIT) {
+   if (cmd & FUTEX_CLOCK_REALTIME)
+   t = ktime_add_safe(ktime_get_real(), t);
+   else
+   t = ktime_add_safe(ktime_get(), t);
+   }
tp = 
}
/*


Just in the interests of readability/maintainability, might it not
make some sense to recode the timeout handling for FUTEX_WAIT
within futex_wait(). I think that part of the reason we're in this
mess of inconsistency is that timeout interpretation is being handled
at too many different points in the code.


And as a follow-on, what is the reason for FUTEX_LOCK_PI only using
CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a
specific amount of time, regardless of wall time.


Yes, that's another weird inconsistency.

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)


On 06/23/2016 08:28 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 07:26:52PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:
In my opinion, we should treat the timeout value as relative for FUTEX_WAIT
regardless of the CLOCK used.


Which requires even more changes as you have to select which clock you are
using for adding the base time.


Right, something like the following?


diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..c39d807 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,8 +3230,12 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, 
u32, val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
-   t = ktime_add_safe(ktime_get(), t);
+   if (cmd == FUTEX_WAIT) {
+   if (cmd & FUTEX_CLOCK_REALTIME)
+   t = ktime_add_safe(ktime_get_real(), t);
+   else
+   t = ktime_add_safe(ktime_get(), t);
+   }
tp = 
}
/*


Just in the interests of readability/maintainability, might it not
make some sense to recode the timeout handling for FUTEX_WAIT
within futex_wait(). I think that part of the reason we're in this
mess of inconsistency is that timeout interpretation is being handled
at too many different points in the code.


And as a follow-on, what is the reason for FUTEX_LOCK_PI only using
CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a
specific amount of time, regardless of wall time.


Yes, that's another weird inconsistency.

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Darren,

On 06/23/2016 06:16 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:
Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner <t...@linutronix.de> wrote:

On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:

And that universe would love to have your documentation of
FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),

I give you almost the full treatment, but I leave REQUEUE_PI to Darren
and FUTEX_WAKE_OP to Jakub. :)
[...]
FUTEX_CLOCK_REALTIME

This option bit can be ored on the futex ops FUTEX_WAIT_BITSET
and FUTEX_WAIT_REQUEUE_PI

If set the kernel treats the user space supplied timeout as
absolute time based on CLOCK_REALTIME.

If not set the kernel treats the user space supplied timeout
as relative time.

Unfortunately, I should have checked the code more carefully...

Me too :)

Seems to be going around...

Looking more carefully at the code, I see understand the situation
is the following:

FUTEX_LOCK_PI
Always uses CLOCK_REALTIME
'timeout' is absolute

Yes.

FUTEX_WAIT_REQUEUE_PI
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute

Yes

FUTEX_WAIT_BITSET
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute

Yes

FUTEX_WAIT
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is relative

Yes.

I've amended the man page to describe those details.

OK, that confirms my question, timeout interpretation as relative or absolute is
based on the op code, not the CLOCK flag.

The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time.

When you say that the "flag was added", which flag do you mean? Or, did you
mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute
time".

I didn't express myself clearly. When Darren added the support for
CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout
support. Anything else does not make sense.

I sent that patch because reading the new man page it struck me as strange that
FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not,
especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to
ALL.

I didn't realize the impact to relative/absolute interpretation of the timeout
value at the time.

I think it was a mistake to introduce a change that made FUTEX_WAIT interpret
the timeout differently based on the CLOCK flag,

I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT
always interprets the clock as relative, regardless of presence/absence of
FUTEX_CLOCK_REALTIME? Am I missing something?

while that interpretation is
independent of the CLOCK flag for all other op codes.

In my opinion, we should treat the timeout value as relative for FUTEX_WAIT
regardless of the CLOCK used.

I realize it's historical, but it is really weird that FUTEX_WAIT interprets
time timeout (relative vs absolute) differently from all of the other
operations.

That would require a change to the man page to eliminate the relative/absolute
language in the FUTEX_CLOCK_REALTIME definition and explicit definitions of the
interpretation for each op code (as Matthew explains above).

Do we agree on that?

Yes.

The man page changes are already in Git. My earlier reply contained the
commit ref:
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/commit/?id=8064bfa5369c6856f606004d02e48ab275e05bed

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Darren,

On 06/23/2016 06:16 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:
Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner wrote: