Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)

On 07/26/2016 04:54 AM, Andrew Vagin wrote:

On Mon, Jul 25, 2016 at 09:59:43AM -0500, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


[snip]


[snip]

So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Please, please make the standard way to compare these things fstat.
That is much less magic than a symlink, and a little more future proof.
Possibly even kcmp.


I like the idea to use kcmp to compare namespaces. I am going to add this
functionality to kcmp and describe all these in the man page.


Hi Andrey,

Can you briefly sketch out the proposed API and how it would be used?
I'd find it useful to see that even before the implementation.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)

On 07/26/2016 04:54 AM, Andrew Vagin wrote:

On Mon, Jul 25, 2016 at 09:59:43AM -0500, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)"  writes:


[snip]


[snip]

So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Please, please make the standard way to compare these things fstat.
That is much less magic than a symlink, and a little more future proof.
Possibly even kcmp.


I like the idea to use kcmp to compare namespaces. I am going to add this
functionality to kcmp and describe all these in the man page.


Hi Andrey,

Can you briefly sketch out the proposed API and how it would be used?
I'd find it useful to see that even before the implementation.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-25 Thread Michael Kerrisk (man-pages)

Hi Eric,

On 07/25/2016 03:18 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


Hi Andrey,

On 07/22/2016 08:25 PM, Andrey Vagin wrote:

On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

Hi Andrey,


On 07/21/2016 11:06 PM, Andrew Vagin wrote:


On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:







Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.



Hi Michael,

A patch for man-pages is attached. It adds the following text to
namespaces(7).

Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
pace file descriptors.  The correct syntax is:

  fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
  Returns a file descriptor that refers to an owning  user  names‐
  pace.

NS_GET_PARENT
  Returns  a  file  descriptor  that refers to a parent namespace.
  This ioctl(2) can be used for pid and user namespaces. For  user
  namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
  ing.


For each of the above, I think it is worth mentioning that the
close-on-exec flag is set for the returned file descriptor.


Hmm.  That is an odd default.


Why do you say that? It's pretty common as the default for various
APIs that create new FDs these days. (There's of course a strong argument
that the original UNIX default was a design blunder...)



In addition to generic ioctl(2) errors, the following specific ones can
occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is  outside  of the current namespace
  scope.


Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
user namespace"?


Having looked at that bit of code I don't think capabilities really
have a role to play.


Yes, I caught up with that now. I await to see how this plays out
in the next patch version.


ENOENT ns_fd refers to the init namespace.



Thanks for this. But still part of the question remains unanswered.
How do we (in user-space) use the file descriptors to answer any of
the questions that this patch series was designed to solve? (This
info should be in the commit message and the man-pages patch.)


I'm sorry, but I am not sure that I understand what you ask.

Here are the origin questions:
Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Here is an example which shows how we can get the owning namespace
inode number by using these ioctl-s.

$ ls -l /proc/13929/ns/pid
lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'

$ ./nsowner /proc/13929/ns/pid
user:[4026532227]

The owning user namespace for pid:[4026532228] is user:[4026532227].

The nsowner  tool is cimpiled from this code:

int main(int argc, char *argv[])
{
char buf[128], path[] = "/proc/self/fd/0123456789";
int ns, uns, ret;

ns = open(argv[1], O_RDONLY);
if (ns < 0)
return 1;

uns = ioctl(ns, NS_GET_USERNS);
if (uns < 0)
return 1;

snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
ret = readlink(path, buf, sizeof(buf) - 1);
if (ret < 0)
return 1;
buf[ret] = 0;

printf("%s\n", buf);

return 0;
}


So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Please, please make the standard way to compare these things fstat.
That is much less magic than a symlink, and a little more future proof.
Possibly even kcmp.


As in fstat() to get the st_ino field, right?

Cheers,

Michael


At some point we will care about migrating a migrating sub-container and we
may have to have some minor changes.

Eric




--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-25 Thread Michael Kerrisk (man-pages)

Hi Eric,

On 07/25/2016 03:18 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)"  writes:


Hi Andrey,

On 07/22/2016 08:25 PM, Andrey Vagin wrote:

On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
 wrote:

Hi Andrey,


On 07/21/2016 11:06 PM, Andrew Vagin wrote:


On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:







Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.



Hi Michael,

A patch for man-pages is attached. It adds the following text to
namespaces(7).

Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
pace file descriptors.  The correct syntax is:

  fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
  Returns a file descriptor that refers to an owning  user  names‐
  pace.

NS_GET_PARENT
  Returns  a  file  descriptor  that refers to a parent namespace.
  This ioctl(2) can be used for pid and user namespaces. For  user
  namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
  ing.


For each of the above, I think it is worth mentioning that the
close-on-exec flag is set for the returned file descriptor.


Hmm.  That is an odd default.


Why do you say that? It's pretty common as the default for various
APIs that create new FDs these days. (There's of course a strong argument
that the original UNIX default was a design blunder...)



In addition to generic ioctl(2) errors, the following specific ones can
occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is  outside  of the current namespace
  scope.


Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
user namespace"?


Having looked at that bit of code I don't think capabilities really
have a role to play.


Yes, I caught up with that now. I await to see how this plays out
in the next patch version.


ENOENT ns_fd refers to the init namespace.



Thanks for this. But still part of the question remains unanswered.
How do we (in user-space) use the file descriptors to answer any of
the questions that this patch series was designed to solve? (This
info should be in the commit message and the man-pages patch.)


I'm sorry, but I am not sure that I understand what you ask.

Here are the origin questions:
Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Here is an example which shows how we can get the owning namespace
inode number by using these ioctl-s.

$ ls -l /proc/13929/ns/pid
lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'

$ ./nsowner /proc/13929/ns/pid
user:[4026532227]

The owning user namespace for pid:[4026532228] is user:[4026532227].

The nsowner  tool is cimpiled from this code:

int main(int argc, char *argv[])
{
char buf[128], path[] = "/proc/self/fd/0123456789";
int ns, uns, ret;

ns = open(argv[1], O_RDONLY);
if (ns < 0)
return 1;

uns = ioctl(ns, NS_GET_USERNS);
if (uns < 0)
return 1;

snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
ret = readlink(path, buf, sizeof(buf) - 1);
if (ret < 0)
return 1;
buf[ret] = 0;

printf("%s\n", buf);

return 0;
}


So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Please, please make the standard way to compare these things fstat.
That is much less magic than a symlink, and a little more future proof.
Possibly even kcmp.


As in fstat() to get the st_ino field, right?

Cheers,

Michael


At some point we will care about migrating a migrating sub-container and we
may have to have some minor changes.

Eric




--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-25 Thread Michael Kerrisk (man-pages)

Hi Andrey,

On 07/22/2016 08:25 PM, Andrey Vagin wrote:

On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

Hi Andrey,


On 07/21/2016 11:06 PM, Andrew Vagin wrote:


On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:







Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.



Hi Michael,

A patch for man-pages is attached. It adds the following text to
namespaces(7).

Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
pace file descriptors.  The correct syntax is:

  fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
  Returns a file descriptor that refers to an owning  user  names‐
  pace.

NS_GET_PARENT
  Returns  a  file  descriptor  that refers to a parent namespace.
  This ioctl(2) can be used for pid and user namespaces. For  user
  namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
  ing.


For each of the above, I think it is worth mentioning that the
close-on-exec flag is set for the returned file descriptor.



In addition to generic ioctl(2) errors, the following specific ones can
occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is  outside  of the current namespace
  scope.


Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
user namespace"?



ENOENT ns_fd refers to the init namespace.



Thanks for this. But still part of the question remains unanswered.
How do we (in user-space) use the file descriptors to answer any of
the questions that this patch series was designed to solve? (This
info should be in the commit message and the man-pages patch.)


I'm sorry, but I am not sure that I understand what you ask.

Here are the origin questions:
Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Here is an example which shows how we can get the owning namespace
inode number by using these ioctl-s.

$ ls -l /proc/13929/ns/pid
lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'

$ ./nsowner /proc/13929/ns/pid
user:[4026532227]

The owning user namespace for pid:[4026532228] is user:[4026532227].

The nsowner  tool is cimpiled from this code:

int main(int argc, char *argv[])
{
char buf[128], path[] = "/proc/self/fd/0123456789";
int ns, uns, ret;

ns = open(argv[1], O_RDONLY);
if (ns < 0)
return 1;

uns = ioctl(ns, NS_GET_USERNS);
if (uns < 0)
return 1;

snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
ret = readlink(path, buf, sizeof(buf) - 1);
if (ret < 0)
return 1;
buf[ret] = 0;

printf("%s\n", buf);

return 0;
}


So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Does this example answer to the origin question?


Yes.


If it isn't, could
you eloborate what you expect to see here.

And I wrote one more example which show all relationships between
namespaces. It enumirates all processes in a system, collects all
namespaces and determins parent and owning namespaces for each of
them, then it constructs a namespace tree and shows it.

Here is a code: https://gist.github.com/avagin/db805f95e15ffb0af7e559dbb8de4418


That's great! Thanks!
 

Here is an example of output for my test system:
[root@fc24 nsfs]# ./nstree
user:[4026531837]
 \__  mnt:[4026532203]
 \__  ipc:[4026531839]
 \__  user:[4026532224]
 \__  user:[4026532226]
 \__  user:[4026532227]
 \__  pid:[4026532228]
 \__  pid:[4026532225]
 \__  pid:[4026532228]
 \__  user:[4026532221]
 \__  pid:[402653]
 \__  user:[4026532223]
 \__  mnt:[4026532211]
 \__  uts:[4026531838]
 \__  cgroup:[4026531835]
 \__  pid:[4026531836]
 \__  pid:[4026532225]
 \__  pid:[4026532228]
 \__  pid:[402653]
 \__  mnt:[4026531857]
 \__  mnt:[4026531840]
 \__  net:[4026531957]


Cheers,

Michael


[1] 

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-25 Thread Michael Kerrisk (man-pages)

Hi Andrey,

On 07/22/2016 08:25 PM, Andrey Vagin wrote:

On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
 wrote:

Hi Andrey,


On 07/21/2016 11:06 PM, Andrew Vagin wrote:


On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:







Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.



Hi Michael,

A patch for man-pages is attached. It adds the following text to
namespaces(7).

Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
pace file descriptors.  The correct syntax is:

  fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
  Returns a file descriptor that refers to an owning  user  names‐
  pace.

NS_GET_PARENT
  Returns  a  file  descriptor  that refers to a parent namespace.
  This ioctl(2) can be used for pid and user namespaces. For  user
  namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
  ing.


For each of the above, I think it is worth mentioning that the
close-on-exec flag is set for the returned file descriptor.



In addition to generic ioctl(2) errors, the following specific ones can
occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is  outside  of the current namespace
  scope.


Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
user namespace"?



ENOENT ns_fd refers to the init namespace.



Thanks for this. But still part of the question remains unanswered.
How do we (in user-space) use the file descriptors to answer any of
the questions that this patch series was designed to solve? (This
info should be in the commit message and the man-pages patch.)


I'm sorry, but I am not sure that I understand what you ask.

Here are the origin questions:
Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Here is an example which shows how we can get the owning namespace
inode number by using these ioctl-s.

$ ls -l /proc/13929/ns/pid
lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'

$ ./nsowner /proc/13929/ns/pid
user:[4026532227]

The owning user namespace for pid:[4026532228] is user:[4026532227].

The nsowner  tool is cimpiled from this code:

int main(int argc, char *argv[])
{
char buf[128], path[] = "/proc/self/fd/0123456789";
int ns, uns, ret;

ns = open(argv[1], O_RDONLY);
if (ns < 0)
return 1;

uns = ioctl(ns, NS_GET_USERNS);
if (uns < 0)
return 1;

snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
ret = readlink(path, buf, sizeof(buf) - 1);
if (ret < 0)
return 1;
buf[ret] = 0;

printf("%s\n", buf);

return 0;
}


So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.


Does this example answer to the origin question?


Yes.


If it isn't, could
you eloborate what you expect to see here.

And I wrote one more example which show all relationships between
namespaces. It enumirates all processes in a system, collects all
namespaces and determins parent and owning namespaces for each of
them, then it constructs a namespace tree and shows it.

Here is a code: https://gist.github.com/avagin/db805f95e15ffb0af7e559dbb8de4418


That's great! Thanks!
 

Here is an example of output for my test system:
[root@fc24 nsfs]# ./nstree
user:[4026531837]
 \__  mnt:[4026532203]
 \__  ipc:[4026531839]
 \__  user:[4026532224]
 \__  user:[4026532226]
 \__  user:[4026532227]
 \__  pid:[4026532228]
 \__  pid:[4026532225]
 \__  pid:[4026532228]
 \__  user:[4026532221]
 \__  pid:[402653]
 \__  user:[4026532223]
 \__  mnt:[4026532211]
 \__  uts:[4026531838]
 \__  cgroup:[4026531835]
 \__  pid:[4026531836]
 \__  pid:[4026532225]
 \__  pid:[4026532228]
 \__  pid:[402653]
 \__  mnt:[4026531857]
 \__  mnt:[4026531840]
 \__  net:[4026531957]


Cheers,

Michael


[1] https://lkml.org/lkml/2016/7/6/15

Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-21 Thread Michael Kerrisk (man-pages)

Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running system.
Another would be to answer the question: what capability does process X have to
perform operations on a resource governed by namespace Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we age going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:

Grumble, Grumble.  I think this may actually a case for creating ioctls
for these two cases.  Now that random nsfs file descriptors are bind
mountable the original reason for using proc files is not as pressing.

One ioctl for the user namespace that owns a file descriptor.
One ioctl for the parent namespace of a namespace file descriptor.


Here is an implementaions of these ioctl-s.


Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.

Thanks,

Michael



[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Cc: "Eric W. Biederman" <ebied...@xmission.com>
Cc: James Bottomley <james.bottom...@hansenpartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com>
Cc: "W. Trevor King" <wk...@tremily.us>
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hal...@canonical.com>

--
2.5.5





--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

2016-07-21 Thread Michael Kerrisk (man-pages)

Hi Andrey,

On 07/14/2016 08:20 PM, Andrey Vagin wrote:

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running system.
Another would be to answer the question: what capability does process X have to
perform operations on a resource governed by namespace Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we age going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:

Grumble, Grumble.  I think this may actually a case for creating ioctls
for these two cases.  Now that random nsfs file descriptors are bind
mountable the original reason for using proc files is not as pressing.

One ioctl for the user namespace that owns a file descriptor.
One ioctl for the parent namespace of a namespace file descriptor.


Here is an implementaions of these ioctl-s.


Could you add here an of the API in detail: what do these FDs refer to,
and how do you use them to solve the use case? And could you you add
that info to the commit messages please.

Thanks,

Michael



[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Cc: "Eric W. Biederman" 
Cc: James Bottomley 
Cc: "Michael Kerrisk (man-pages)" 
Cc: "W. Trevor King" 
Cc: Alexander Viro 
Cc: Serge Hallyn 

--
2.5.5





--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


man-pages-4.07 is released

2016-07-18 Thread Michael Kerrisk (man-pages)



Gidday,

The Linux man-pages maintainer proudly announces:

man-pages-4.07 - man pages for Linux

This release includes input and contributions from
around 50 people. Over 140 pages saw changes, ranging
from typo fixes through to page rewrites and 4 newly
created pages.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_4.07

A short summary of the release is blogged at:
http://linux-man-pages.blogspot.com/2016/07/man-pages-407-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers on LKML is shown below.

Cheers,

Michael

 Changes in man-pages-4.07 

Released: 2016-07-17, Ulm


New and rewritten pages
---

ioctl_fideduperange.2
Darrick J. Wong  [Christoph Hellwig, Michael Kerrisk]
New page documenting the FIDEDUPERANGE ioctl
Document the FIDEDUPERANGE ioctl, formerly known as
BTRFS_IOC_EXTENT_SAME.

ioctl_ficlonerange.2
Darrick J. Wong  [Christoph Hellwig, Michael Kerrisk]
New page documenting FICLONE and FICLONERANGE ioctls
Document the FICLONE and FICLONERANGE ioctls, formerly known as
the BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls.

mount_namespaces.7
Michael Kerrisk  [Michael Kerrisk]
New page describing mount namespaces


Newly documented interfaces in existing pages
-

mount.2
Michael Kerrisk
Document flags used to set propagation type
Document MS_SHARED, MS_PRIVATE, MS_SLAVE, and MS_UNBINDABLE.
Michael Kerrisk
Document the MS_REC flag

ptrace.2
Michael Kerrisk  [Kees Cook, Jann Horn, Eric W. Biederman, Stephen Smalley]
Document ptrace access modes

proc.5
Michael Kerrisk
Document /proc/[pid]/timerslack_ns
Michael Kerrisk
Document /proc/PID/status 'Ngid' field
Michael Kerrisk
Document /proc/PID/status fields: 'NStgid', 'NSpid', 'NSpgid', 'NSsid'
Michael Kerrisk
Document /proc/PID/status 'Umask' field


Changes to individual pages
---

ldd.1
Michael Kerrisk
Add a little more detail on why ldd is unsafe with untrusted executables

futex.2
Michael Kerrisk
Correct an ENOSYS error description
Since Linux 4.5, FUTEX_CLOCK_REALTIME is allowed with FUTEX_WAIT.
Michael Kerrisk  [Darren Hart]
Remove crufty text about FUTEX_WAIT_BITSET interpretation of timeout
Since Linux 4.5, FUTEX_WAIT also understands
FUTEX_CLOCK_REALTIME.
Michael Kerrisk  [Thomas Gleixner]
Explain how to get equivalent of FUTEX_WAIT with an absolute timeout
Michael Kerrisk
Describe FUTEX_BITSET_MATCH_ANY
Describe FUTEX_BITSET_MATCH_ANY and FUTEX_WAIT and FUTEX_WAKE
equivalences.
Michael Kerrisk  [Thomas Gleixner, Darren Hart]
Fix descriptions of various timeouts
Michael Kerrisk
Clarify clock default and choices for FUTEX_WAIT

kcmp.2
Michael Kerrisk
kcmp() is governed by PTRACE_MODE_READ_REALCREDS

mount.2
Michael Kerrisk
Restructure discussion of 'mountflags' into functional groups
The existing text makes no differentiation between different
"classes" of mount flags. However, certain flags such as
MS_REMOUNT, MS_BIND, MS_MOVE, etc. determine the general
type of operation that mount() performs. Furthermore, the
choice of which class of operation to perform is performed in
a certain order, and that order is significant if multiple
flags are specified. Restructure and extend the text to
reflect these details.
    Michael Kerrisk
Since Linux 2.6.26, bind mounts can be made read-only

process_vm_readv.2
    Michael Kerrisk
Rephrase permission rules in terms of a ptrace access mode check

ptrace.2
    Michael Kerrisk  [Jann Horn]
Update Yama ptrace_scope documentation
Reframe the discussion in terms of PTRACE_MODE_ATTACH checks,
and make a few other minor tweaks and additions.
    Michael Kerrisk, Jann Horn
Note that user namespaces can be used to bypass Yama protections
    Michael Kerrisk
Note that PTRACE_SEIZE is subject to a ptrace access mode check
    Michael Kerrisk
Rephrase PTRACE_ATTACH permissions in terms of ptrace access mode check

wait.2
    Michael Kerrisk
Since Linux 4.7, __WALL is implied if child being ptraced
    Michael Kerrisk
waitid() now (since Linux 4.7) also supports __WNOTHREAD/__WCLONE/__WALL

proc.5
    Michael Kerrisk
/proc/PID/fd/* ar

man-pages-4.07 is released

2016-07-18 Thread Michael Kerrisk (man-pages)



Gidday,

The Linux man-pages maintainer proudly announces:

man-pages-4.07 - man pages for Linux

This release includes input and contributions from
around 50 people. Over 140 pages saw changes, ranging
from typo fixes through to page rewrites and 4 newly
created pages.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_4.07

A short summary of the release is blogged at:
http://linux-man-pages.blogspot.com/2016/07/man-pages-407-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers on LKML is shown below.

Cheers,

Michael

 Changes in man-pages-4.07 

Released: 2016-07-17, Ulm


New and rewritten pages
---

ioctl_fideduperange.2
Darrick J. Wong  [Christoph Hellwig, Michael Kerrisk]
New page documenting the FIDEDUPERANGE ioctl
Document the FIDEDUPERANGE ioctl, formerly known as
BTRFS_IOC_EXTENT_SAME.

ioctl_ficlonerange.2
Darrick J. Wong  [Christoph Hellwig, Michael Kerrisk]
New page documenting FICLONE and FICLONERANGE ioctls
Document the FICLONE and FICLONERANGE ioctls, formerly known as
the BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls.

mount_namespaces.7
Michael Kerrisk  [Michael Kerrisk]
New page describing mount namespaces


Newly documented interfaces in existing pages
-

mount.2
Michael Kerrisk
Document flags used to set propagation type
Document MS_SHARED, MS_PRIVATE, MS_SLAVE, and MS_UNBINDABLE.
Michael Kerrisk
Document the MS_REC flag

ptrace.2
Michael Kerrisk  [Kees Cook, Jann Horn, Eric W. Biederman, Stephen Smalley]
Document ptrace access modes

proc.5
Michael Kerrisk
Document /proc/[pid]/timerslack_ns
Michael Kerrisk
Document /proc/PID/status 'Ngid' field
Michael Kerrisk
Document /proc/PID/status fields: 'NStgid', 'NSpid', 'NSpgid', 'NSsid'
Michael Kerrisk
Document /proc/PID/status 'Umask' field


Changes to individual pages
---

ldd.1
Michael Kerrisk
Add a little more detail on why ldd is unsafe with untrusted executables

futex.2
Michael Kerrisk
Correct an ENOSYS error description
Since Linux 4.5, FUTEX_CLOCK_REALTIME is allowed with FUTEX_WAIT.
Michael Kerrisk  [Darren Hart]
Remove crufty text about FUTEX_WAIT_BITSET interpretation of timeout
Since Linux 4.5, FUTEX_WAIT also understands
FUTEX_CLOCK_REALTIME.
Michael Kerrisk  [Thomas Gleixner]
Explain how to get equivalent of FUTEX_WAIT with an absolute timeout
Michael Kerrisk
Describe FUTEX_BITSET_MATCH_ANY
Describe FUTEX_BITSET_MATCH_ANY and FUTEX_WAIT and FUTEX_WAKE
equivalences.
Michael Kerrisk  [Thomas Gleixner, Darren Hart]
Fix descriptions of various timeouts
Michael Kerrisk
Clarify clock default and choices for FUTEX_WAIT

kcmp.2
Michael Kerrisk
kcmp() is governed by PTRACE_MODE_READ_REALCREDS

mount.2
Michael Kerrisk
Restructure discussion of 'mountflags' into functional groups
The existing text makes no differentiation between different
"classes" of mount flags. However, certain flags such as
MS_REMOUNT, MS_BIND, MS_MOVE, etc. determine the general
type of operation that mount() performs. Furthermore, the
choice of which class of operation to perform is performed in
a certain order, and that order is significant if multiple
flags are specified. Restructure and extend the text to
reflect these details.
    Michael Kerrisk
Since Linux 2.6.26, bind mounts can be made read-only

process_vm_readv.2
    Michael Kerrisk
Rephrase permission rules in terms of a ptrace access mode check

ptrace.2
    Michael Kerrisk  [Jann Horn]
Update Yama ptrace_scope documentation
Reframe the discussion in terms of PTRACE_MODE_ATTACH checks,
and make a few other minor tweaks and additions.
    Michael Kerrisk, Jann Horn
Note that user namespaces can be used to bypass Yama protections
    Michael Kerrisk
Note that PTRACE_SEIZE is subject to a ptrace access mode check
    Michael Kerrisk
Rephrase PTRACE_ATTACH permissions in terms of ptrace access mode check

wait.2
    Michael Kerrisk
Since Linux 4.7, __WALL is implied if child being ptraced
    Michael Kerrisk
waitid() now (since Linux 4.7) also supports __WNOTHREAD/__WCLONE/__WALL

proc.5
    Michael Kerrisk
/proc/PID/fd/* ar

Re: Bugzilla spam

2016-07-13 Thread Michael Kerrisk (man-pages)
Hello Konstantin,

On 13 July 2016 at 20:37, Konstantin Ryabitsev <mri...@kernel.org> wrote:
> On Wed, Jul 13, 2016 at 08:28:18PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Konstantin,
>>
>> The man-pages Bugzilla component (as well as other components on
>> Bugzilla by the look of things) is receiving vast quantities of spam.
>> What can be done about this? (Just marking the bugs private and
>> closing isn't workable. There's just too many bugs coming in...).
>
> Not much can be done. :( Bugzilla's default spam-fighting capabilities
> are abysmal -- I can't even delete any accounts without installing
> multiple extensions. I'm actively investigating what we can do to
> improve the situation and will follow up shortly.

Okay, thanks. In the meantime, is it possible for you to lock the
man-pages component so that no further bug reports can be made via
that component?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Bugzilla spam

2016-07-13 Thread Michael Kerrisk (man-pages)
Hello Konstantin,

On 13 July 2016 at 20:37, Konstantin Ryabitsev  wrote:
> On Wed, Jul 13, 2016 at 08:28:18PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Konstantin,
>>
>> The man-pages Bugzilla component (as well as other components on
>> Bugzilla by the look of things) is receiving vast quantities of spam.
>> What can be done about this? (Just marking the bugs private and
>> closing isn't workable. There's just too many bugs coming in...).
>
> Not much can be done. :( Bugzilla's default spam-fighting capabilities
> are abysmal -- I can't even delete any accounts without installing
> multiple extensions. I'm actively investigating what we can do to
> improve the situation and will follow up shortly.

Okay, thanks. In the meantime, is it possible for you to lock the
man-pages component so that no further bug reports can be made via
that component?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Bugzilla spam

2016-07-13 Thread Michael Kerrisk (man-pages)
Hello Konstantin,

The man-pages Bugzilla component (as well as other components on
Bugzilla by the look of things) is receiving vast quantities of spam.
What can be done about this? (Just marking the bugs private and
closing isn't workable. There's just too many bugs coming in...).

Thanks

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Bugzilla spam

2016-07-13 Thread Michael Kerrisk (man-pages)
Hello Konstantin,

The man-pages Bugzilla component (as well as other components on
Bugzilla by the look of things) is receiving vast quantities of spam.
What can be done about this? (Just marking the bugs private and
closing isn't workable. There's just too many bugs coming in...).

Thanks

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)

On 07/08/2016 05:26 AM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
wrote:

On 7 July 2016 at 17:01, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.
 So
yeah you could probably get a tree into a state that would
be
wrongly recreated. Create a new netns, bind mount it, exit;
  Have
another task create a new user_ns, bind mount it, exit;
 Third
task setns()s first to the new netns then to the new
user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it:
so
root can set up a nesting hierarchy, bind it and destroy the
pids
but I know of no current orchestration system which does
this.

Actually, I have to back pedal a bit: the way I currently set
up
architecture emulation containers does precisely this: I set
up the
namespaces unprivileged with child mount namespaces, but then
I ask
root to bind the userns and kill the process that created it
so I
have a permanent handle to enter the namespace by, so I
suspect
that when our current orchestration systems get more
sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an
option
(just add a show_options entry to the superblock ops), but
the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace
specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all
namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I
thought you
were having is that there's no way of seeing what it is.

nsfs is the Namespace fileystem where bound namespaces appear to
a cat
of /proc/self/mounts.  It can display any information that's in
ns_common (the common core of namespaces) but the owning user_ns
pointer currently isn't in this structure.  Every user namespace
has a
pointer to it, but they're all privately embedded in the
individual
namespace specific structures.  What I was proposing was that
since
every current namespace has a pointer somewhere to the owning
user
namespace, we could abstract this out into ns_common so it's now
accessible to be displayed by nsfs, probably as a mount option.


James, I am not sure that I understood you correctly. We have one
file system for all namespace files, how we can show per-file
properties
in mount options. I think we can show all required information in
fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
/proc/pid/fdinfo/X for it.


Here is a proof-of-concept patch.

How it works:

In [1]: import os

In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)

In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
pos:0
flags:  010
mnt_id: 2
userns: 4026531837

In [4]: print "/proc/self/ns/user -> %s" %
os.readlink("/proc/self/ns/user")
/proc/self/ns/user -> user:[4026531837]


can't you just do

readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

?

But what Michael was asking about was the parent user_ns of all the
other namespaces ...


Just to reiterate, what I'm interested in is the introspection use
case (but there's clearly several other interesting use cases here).
The idea is to be able to answer these questions

1. For each userns, what is the parent of that userns?

2. For each non-user namespace, what is the owning userns?

This enables us to understand the userns hierarchy, which
matters in terms of answering the question: what capabilities
does process X have in namespace Y?
   
Cheers,


Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)

On 07/08/2016 05:26 AM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
wrote:

On 7 July 2016 at 17:01, James Bottomley
 wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.
 So
yeah you could probably get a tree into a state that would
be
wrongly recreated. Create a new netns, bind mount it, exit;
  Have
another task create a new user_ns, bind mount it, exit;
 Third
task setns()s first to the new netns then to the new
user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it:
so
root can set up a nesting hierarchy, bind it and destroy the
pids
but I know of no current orchestration system which does
this.

Actually, I have to back pedal a bit: the way I currently set
up
architecture emulation containers does precisely this: I set
up the
namespaces unprivileged with child mount namespaces, but then
I ask
root to bind the userns and kill the process that created it
so I
have a permanent handle to enter the namespace by, so I
suspect
that when our current orchestration systems get more
sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an
option
(just add a show_options entry to the superblock ops), but
the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace
specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all
namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I
thought you
were having is that there's no way of seeing what it is.

nsfs is the Namespace fileystem where bound namespaces appear to
a cat
of /proc/self/mounts.  It can display any information that's in
ns_common (the common core of namespaces) but the owning user_ns
pointer currently isn't in this structure.  Every user namespace
has a
pointer to it, but they're all privately embedded in the
individual
namespace specific structures.  What I was proposing was that
since
every current namespace has a pointer somewhere to the owning
user
namespace, we could abstract this out into ns_common so it's now
accessible to be displayed by nsfs, probably as a mount option.


James, I am not sure that I understood you correctly. We have one
file system for all namespace files, how we can show per-file
properties
in mount options. I think we can show all required information in
fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
/proc/pid/fdinfo/X for it.


Here is a proof-of-concept patch.

How it works:

In [1]: import os

In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)

In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
pos:0
flags:  010
mnt_id: 2
userns: 4026531837

In [4]: print "/proc/self/ns/user -> %s" %
os.readlink("/proc/self/ns/user")
/proc/self/ns/user -> user:[4026531837]


can't you just do

readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

?

But what Michael was asking about was the parent user_ns of all the
other namespaces ...


Just to reiterate, what I'm interested in is the introspection use
case (but there's clearly several other interesting use cases here).
The idea is to be able to answer these questions

1. For each userns, what is the parent of that userns?

2. For each non-user namespace, what is the owning userns?

This enables us to understand the userns hierarchy, which
matters in terms of answering the question: what capabilities
does process X have in namespace Y?
   
Cheers,


Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)

On 07/07/2016 09:17 PM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:

On 7 July 2016 at 17:01, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.  So
yeah you could probably get a tree into a state that would be
wrongly recreated. Create a new netns, bind mount it, exit;  Have
another task create a new user_ns, bind mount it, exit;  Third
task setns()s first to the new netns then to the new user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it: so
root can set up a nesting hierarchy, bind it and destroy the pids
but I know of no current orchestration system which does this.

Actually, I have to back pedal a bit: the way I currently set up
architecture emulation containers does precisely this: I set up the
namespaces unprivileged with child mount namespaces, but then I ask
root to bind the userns and kill the process that created it so I
have a permanent handle to enter the namespace by, so I suspect
that when our current orchestration systems get more sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an option
(just add a show_options entry to the superblock ops), but the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I thought you
were having is that there's no way of seeing what it is.


Your words "and enforce all namespaces having owning user_ns" were
what left me puzzled--it sounded to me that the implication was
that this is not "enforced" right now.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)

On 07/07/2016 09:17 PM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:

On 7 July 2016 at 17:01, James Bottomley
 wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.  So
yeah you could probably get a tree into a state that would be
wrongly recreated. Create a new netns, bind mount it, exit;  Have
another task create a new user_ns, bind mount it, exit;  Third
task setns()s first to the new netns then to the new user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it: so
root can set up a nesting hierarchy, bind it and destroy the pids
but I know of no current orchestration system which does this.

Actually, I have to back pedal a bit: the way I currently set up
architecture emulation containers does precisely this: I set up the
namespaces unprivileged with child mount namespaces, but then I ask
root to bind the userns and kill the process that created it so I
have a permanent handle to enter the namespace by, so I suspect
that when our current orchestration systems get more sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an option
(just add a show_options entry to the superblock ops), but the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I thought you
were having is that there's no way of seeing what it is.


Your words "and enforce all namespaces having owning user_ns" were
what left me puzzled--it sounded to me that the implication was
that this is not "enforced" right now.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)
On 7 July 2016 at 17:01, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:
> On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
>> > Hi Serge,
>> >
>> > On 6 July 2016 at 16:13, Serge E. Hallyn <se...@hallyn.com> wrote:
>> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
>> > > -pages) wrote:
>> > > > [Rats! Doing now what I should have down to start with. Looping
>> > > > some lists and CRIU and other possibly relevant people into
>> > > > this conversation]
>> > > >
>> > > > Hi Eric,
>> > > >
>> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > ebied...@xmission.com> wrote:
>> > > > > "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com>
>> > > > > writes:
>> > > > >
>> > > > > > Hi Eric,
>> > > > > >
>> > > > > > I have a question. Is there any way currently to discover
>> > > > > > which user namespace a particular nonuser namespace is
>> > > > > > governed by? Maybe I am missing something, but there does
>> > > > > > not seem to be a way to do this. Also, can one discover
>> > > > > > which userns is the parent of a given userns? Again, I
>> > > > > > can't see a way to do this.
>> > > > > >
>> > > > > > The point here is introspecting so that a process might
>> > > > > > determine what its capabilities are when operating on some
>> > > > > > resource governed by a (nonuser) namespace.
>> > > > >
>> > > > > To the best of my knowledge that there is not an interface to
>> > > > > get that information.  It would be good to have such an
>> > > > > interface for no other reason than the CRIU folks are going
>> > > > > to need it at some point.  I am a bit surprised they have not
>> > > > > complained yet.
>> > >
>> > > I don't think they need it.  They do in fact have what they need.
>> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
>> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1
>> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
>> > > does not matter.
>> > >
>> > > At restart, it doesn't matter which task originally created the
>> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it
>> > > creates the userns, sets up the mapping, and T1_1 and T2_1
>> > > setns() to it.
>> >
>> > I'm missing something here. How does the parental relationships
>> > between the user namespaces get reconstructed? Those relationships
>> > will govern what capabilities a process will have in various user
>> > namespaces.
>
> Actually, you get the parent namespace from the process tree by
> tracking the user namespaces of the parent pids.   Currently non-root
> users can't bind the namespace, so the only way to keep a new user_ns
> around if you're not root is to keep the process around, so for
> multiply nested user namespaces you can usually build the user_ns
> hierarchy by looking at the process hierarchy.  Conversely, if the
> process is reparented to init, chances are that the user_ns is also
> parented to init_user_ns.

Yes, but "chances are" == this isn't robust.  PR_SET_CHILD_SUBREAPER
further complicates things.

By the way, is that really what happens? Do child user namespaces get
reparented to the grandparent ns if the parent ns disappears (i.e.,
ceases to have any members and no bind mounts)? I hadn't thought about
that scenario before. It may be worth documenting in
user_namespaces(7).

>> Hm.  Probably best-effort based on the process hierarchy.  So yeah
>> you could probably get a tree into a state that would be wrongly
>> recreated. Create a new netns, bind mount it, exit;  Have another
>> task create a new user_ns, bind mount it, exit;  Third task setns()s
>> first to the new netns then to the new user_ns.  I suspect criu will
>> recreate that wrongly.
>
> This is a bit pathological, and you have to be root to do it: so root
> can set up a nesting hierarchy, bind it and destroy the pids but I know
> of no current orchestration system which does this.
>
> Actually, I have to back pedal a bit: the way I currently set up
> architec

Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)
On 7 July 2016 at 17:01, James Bottomley
 wrote:
> On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
>> > Hi Serge,
>> >
>> > On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
>> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
>> > > -pages) wrote:
>> > > > [Rats! Doing now what I should have down to start with. Looping
>> > > > some lists and CRIU and other possibly relevant people into
>> > > > this conversation]
>> > > >
>> > > > Hi Eric,
>> > > >
>> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > ebied...@xmission.com> wrote:
>> > > > > "Michael Kerrisk (man-pages)" 
>> > > > > writes:
>> > > > >
>> > > > > > Hi Eric,
>> > > > > >
>> > > > > > I have a question. Is there any way currently to discover
>> > > > > > which user namespace a particular nonuser namespace is
>> > > > > > governed by? Maybe I am missing something, but there does
>> > > > > > not seem to be a way to do this. Also, can one discover
>> > > > > > which userns is the parent of a given userns? Again, I
>> > > > > > can't see a way to do this.
>> > > > > >
>> > > > > > The point here is introspecting so that a process might
>> > > > > > determine what its capabilities are when operating on some
>> > > > > > resource governed by a (nonuser) namespace.
>> > > > >
>> > > > > To the best of my knowledge that there is not an interface to
>> > > > > get that information.  It would be good to have such an
>> > > > > interface for no other reason than the CRIU folks are going
>> > > > > to need it at some point.  I am a bit surprised they have not
>> > > > > complained yet.
>> > >
>> > > I don't think they need it.  They do in fact have what they need.
>> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
>> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1
>> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
>> > > does not matter.
>> > >
>> > > At restart, it doesn't matter which task originally created the
>> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it
>> > > creates the userns, sets up the mapping, and T1_1 and T2_1
>> > > setns() to it.
>> >
>> > I'm missing something here. How does the parental relationships
>> > between the user namespaces get reconstructed? Those relationships
>> > will govern what capabilities a process will have in various user
>> > namespaces.
>
> Actually, you get the parent namespace from the process tree by
> tracking the user namespaces of the parent pids.   Currently non-root
> users can't bind the namespace, so the only way to keep a new user_ns
> around if you're not root is to keep the process around, so for
> multiply nested user namespaces you can usually build the user_ns
> hierarchy by looking at the process hierarchy.  Conversely, if the
> process is reparented to init, chances are that the user_ns is also
> parented to init_user_ns.

Yes, but "chances are" == this isn't robust.  PR_SET_CHILD_SUBREAPER
further complicates things.

By the way, is that really what happens? Do child user namespaces get
reparented to the grandparent ns if the parent ns disappears (i.e.,
ceases to have any members and no bind mounts)? I hadn't thought about
that scenario before. It may be worth documenting in
user_namespaces(7).

>> Hm.  Probably best-effort based on the process hierarchy.  So yeah
>> you could probably get a tree into a state that would be wrongly
>> recreated. Create a new netns, bind mount it, exit;  Have another
>> task create a new user_ns, bind mount it, exit;  Third task setns()s
>> first to the new netns then to the new user_ns.  I suspect criu will
>> recreate that wrongly.
>
> This is a bit pathological, and you have to be root to do it: so root
> can set up a nesting hierarchy, bind it and destroy the pids but I know
> of no current orchestration system which does this.
>
> Actually, I have to back pedal a bit: the way I currently set up
> architecture emulation containers does precisely this: I set up the
> namespaces unprivileged wi

Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)
Hi Serge,

On 6 July 2016 at 16:13, Serge E. Hallyn <se...@hallyn.com> wrote:
> On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
>> [Rats! Doing now what I should have down to start with. Looping some
>> lists and CRIU and other possibly relevant people into this
>> conversation]
>>
>> Hi Eric,
>>
>> On 5 July 2016 at 23:47, Eric W. Biederman <ebied...@xmission.com> wrote:
>> > "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:
>> >
>> >> Hi Eric,
>> >>
>> >> I have a question. Is there any way currently to discover which
>> >> user namespace a particular nonuser namespace is governed by?
>> >> Maybe I am missing something, but there does not seem to be a
>> >> way to do this. Also, can one discover which userns is the
>> >> parent of a given userns? Again, I can't see a way to do this.
>> >>
>> >> The point here is introspecting so that a process might determine
>> >> what its capabilities are when operating on some resource governed
>> >> by a (nonuser) namespace.
>> >
>> > To the best of my knowledge that there is not an interface to get that
>> > information.  It would be good to have such an interface for no other
>> > reason than the CRIU folks are going to need it at some point.  I am a
>> > bit surprised they have not complained yet.
>
> I don't think they need it.  They do in fact have what they need.  Assume
> you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> There's some {handwave} uid mapping, does not matter.
>
> At restart, it doesn't matter which task originally created the new userns.
> criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
> up the mapping, and T1_1 and T2_1 setns() to it.

I'm missing something here. How does the parental relationships
between the user namespaces get reconstructed? Those relationships
will govern what capabilities a process will have in various user
namespaces.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)
Hi Serge,

On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
>> [Rats! Doing now what I should have down to start with. Looping some
>> lists and CRIU and other possibly relevant people into this
>> conversation]
>>
>> Hi Eric,
>>
>> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
>> > "Michael Kerrisk (man-pages)"  writes:
>> >
>> >> Hi Eric,
>> >>
>> >> I have a question. Is there any way currently to discover which
>> >> user namespace a particular nonuser namespace is governed by?
>> >> Maybe I am missing something, but there does not seem to be a
>> >> way to do this. Also, can one discover which userns is the
>> >> parent of a given userns? Again, I can't see a way to do this.
>> >>
>> >> The point here is introspecting so that a process might determine
>> >> what its capabilities are when operating on some resource governed
>> >> by a (nonuser) namespace.
>> >
>> > To the best of my knowledge that there is not an interface to get that
>> > information.  It would be good to have such an interface for no other
>> > reason than the CRIU folks are going to need it at some point.  I am a
>> > bit surprised they have not complained yet.
>
> I don't think they need it.  They do in fact have what they need.  Assume
> you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> There's some {handwave} uid mapping, does not matter.
>
> At restart, it doesn't matter which task originally created the new userns.
> criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
> up the mapping, and T1_1 and T2_1 setns() to it.

I'm missing something here. How does the parental relationships
between the user namespaces get reconstructed? Those relationships
will govern what capabilities a process will have in various user
namespaces.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Michael Kerrisk (man-pages)
[Rats! Doing now what I should have down to start with. Looping some
lists and CRIU and other possibly relevant people into this
conversation]

Hi Eric,

On 5 July 2016 at 23:47, Eric W. Biederman <ebied...@xmission.com> wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:
>
>> Hi Eric,
>>
>> I have a question. Is there any way currently to discover which
>> user namespace a particular nonuser namespace is governed by?
>> Maybe I am missing something, but there does not seem to be a
>> way to do this. Also, can one discover which userns is the
>> parent of a given userns? Again, I can't see a way to do this.
>>
>> The point here is introspecting so that a process might determine
>> what its capabilities are when operating on some resource governed
>> by a (nonuser) namespace.
>
> To the best of my knowledge that there is not an interface to get that
> information.  It would be good to have such an interface for no other
> reason than the CRIU folks are going to need it at some point.  I am a
> bit surprised they have not complained yet.
>
> That said in a normal use scenario I don't think that information is
> needed.
>
> Do you have a particular use case besides checkpoint/restart where this
> is useful?  That might help in coming up with a good userspace interface
> for this information.

So, I spend a moderate amount of time working with people to introduce
them to the namespaces infrastructure, and one topic that comes up now
and this introspection/visualization tools. For example,
nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
in /proc/PID--it's possible to (and someone I was working with did)
write tools that introspect the PID namespace hierarchy to show all of
process's and their PIDs in the various namespace instance. It's a
natural enough thing to want to do, when confronted with the
complexity of the namespaces.

Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Cheers,

Michael




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Michael Kerrisk (man-pages)
[Rats! Doing now what I should have down to start with. Looping some
lists and CRIU and other possibly relevant people into this
conversation]

Hi Eric,

On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> "Michael Kerrisk (man-pages)"  writes:
>
>> Hi Eric,
>>
>> I have a question. Is there any way currently to discover which
>> user namespace a particular nonuser namespace is governed by?
>> Maybe I am missing something, but there does not seem to be a
>> way to do this. Also, can one discover which userns is the
>> parent of a given userns? Again, I can't see a way to do this.
>>
>> The point here is introspecting so that a process might determine
>> what its capabilities are when operating on some resource governed
>> by a (nonuser) namespace.
>
> To the best of my knowledge that there is not an interface to get that
> information.  It would be good to have such an interface for no other
> reason than the CRIU folks are going to need it at some point.  I am a
> bit surprised they have not complained yet.
>
> That said in a normal use scenario I don't think that information is
> needed.
>
> Do you have a particular use case besides checkpoint/restart where this
> is useful?  That might help in coming up with a good userspace interface
> for this information.

So, I spend a moderate amount of time working with people to introduce
them to the namespaces infrastructure, and one topic that comes up now
and this introspection/visualization tools. For example,
nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
in /proc/PID--it's possible to (and someone I was working with did)
write tools that introspect the PID namespace hierarchy to show all of
process's and their PIDs in the various namespace instance. It's a
natural enough thing to want to do, when confronted with the
complexity of the namespaces.

Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Cheers,

Michael




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)

Hi Kees,

On 06/28/2016 10:55 PM, Kees Cook wrote:

On Mon, Jun 27, 2016 at 11:11 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

Hi Jann,


On 06/25/2016 04:30 PM, Jann Horn wrote:


On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.


Maybe clarify "additional credentials that may exist in memory only and thus..."


Done.



   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.



(namespaced) CAP_SYS_PTRACE is also sufficient here.


Both here and in the "admin-only attach" case, it is IMO important to
note that creating a user namespace effectively removes the Yama
protection because the owner of a namespace, when accessing its
contents from outside, is relatively capable.

This means that when a process tries to use namespaces to sandbox
itself, it inadvertently makes itself more accessible.

(This could probably be worked around in the kernel, but such a
workaround would likely not be default, but rather opt-in via a new
flag for clone() and unshare() or so.)



Tanks for catching this!

So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.  By default, the predefined rela‐
  tionship is that the target process must be a child of  the
  caller.


More accurately, must be a descendant of the caller (grand child is fine, etc).


Thanks, Fixed.





  A  target  process  can  employ the prctl(2) PR_SET_PTRACER
  operation to declare a different PID 

Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)

Hi Kees,

On 06/28/2016 10:55 PM, Kees Cook wrote:

On Mon, Jun 27, 2016 at 11:11 PM, Michael Kerrisk (man-pages)
 wrote:

Hi Jann,


On 06/25/2016 04:30 PM, Jann Horn wrote:


On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages)
wrote:


Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.


Maybe clarify "additional credentials that may exist in memory only and thus..."


Done.



   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.



(namespaced) CAP_SYS_PTRACE is also sufficient here.


Both here and in the "admin-only attach" case, it is IMO important to
note that creating a user namespace effectively removes the Yama
protection because the owner of a namespace, when accessing its
contents from outside, is relatively capable.

This means that when a process tries to use namespaces to sandbox
itself, it inadvertently makes itself more accessible.

(This could probably be worked around in the kernel, but such a
workaround would likely not be default, but rather opt-in via a new
flag for clone() and unshare() or so.)



Tanks for catching this!

So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.  By default, the predefined rela‐
  tionship is that the target process must be a child of  the
  caller.


More accurately, must be a descendant of the caller (grand child is fine, etc).


Thanks, Fixed.





  A  target  process  can  employ the prctl(2) PR_SET_PTRACER
  operation to declare a different PID  that  is  allowed  to
   

Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)

Hi Jann,
...


So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.


Nit: The grammar in this sentence seems wrong to me.
s/or it have/or it must have/?


Yep, thanks for catching that. Fixed now.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)

Hi Jann,
...


So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.


Nit: The grammar in this sentence seems wrong to me.
s/or it have/or it must have/?


Yep, thanks for catching that. Fixed now.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)

Hi Jann,

On 06/25/2016 04:30 PM, Jann Horn wrote:

On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages) wrote:

Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.

   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.


(namespaced) CAP_SYS_PTRACE is also sufficient here.


Both here and in the "admin-only attach" case, it is IMO important to
note that creating a user namespace effectively removes the Yama
protection because the owner of a namespace, when accessing its
contents from outside, is relatively capable.

This means that when a process tries to use namespaces to sandbox
itself, it inadvertently makes itself more accessible.

(This could probably be worked around in the kernel, but such a
workaround would likely not be default, but rather opt-in via a new
flag for clone() and unshare() or so.)


Tanks for catching this!

So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.  By default, the predefined rela‐
  tionship is that the target process must be a child of  the
  caller.

  A  target  process  can  employ the prctl(2) PR_SET_PTRACER
  operation to declare a different PID  that  is  allowed  to
  perform  PTRACE_MODE_ATTACH  operations on the target.  See
  the kernel source file Documentation/security/Yama.txt  for
  further details.

  The use of PTRACE_TRACEME is unchanged.

   2 ("admin-only attach")
  Only  processes  with  the CAP_SYS_PTRACE capability 

Re: Review of ptrace Yama ptrace_scope description

2016-06-28 Thread Michael Kerrisk (man-pages)

Hi Jann,

On 06/25/2016 04:30 PM, Jann Horn wrote:

On Sat, Jun 25, 2016 at 09:30:43AM +0200, Michael Kerrisk (man-pages) wrote:

Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.

   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.


(namespaced) CAP_SYS_PTRACE is also sufficient here.


Both here and in the "admin-only attach" case, it is IMO important to
note that creating a user namespace effectively removes the Yama
protection because the owner of a namespace, when accessing its
contents from outside, is relatively capable.

This means that when a process tries to use namespaces to sandbox
itself, it inadvertently makes itself more accessible.

(This could probably be worked around in the kernel, but such a
workaround would likely not be default, but rather opt-in via a new
flag for clone() and unshare() or so.)


Tanks for catching this!

So I've made that section of text:

   A  process  that  has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the  following
   values:

   0 ("classic ptrace permissions")
  No  additional  restrictions  on  operations  that  perform
  PTRACE_MODE_ATTACH checks (beyond those imposed by the com‐
  moncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  Whenperforminganoperation   that   requires   a
  PTRACE_MODE_ATTACH check, the calling process  must  either
  have the CAP_SYS_PTRACE capability in the user namespace of
  the target process or it  have  a  predefined  relationship
  with  the target process.  By default, the predefined rela‐
  tionship is that the target process must be a child of  the
  caller.

  A  target  process  can  employ the prctl(2) PR_SET_PTRACER
  operation to declare a different PID  that  is  allowed  to
  perform  PTRACE_MODE_ATTACH  operations on the target.  See
  the kernel source file Documentation/security/Yama.txt  for
  further details.

  The use of PTRACE_TRACEME is unchanged.

   2 ("admin-only attach")
  Only  processes  with  the CAP_SYS_PTRACE capability 

Re: [PATCH v2 2/2] namespaces: add transparent user namespaces

2016-06-26 Thread Michael Kerrisk
Hi Jann,

Patches such as this really should CC linux-api@ (added).

On Sat, Jun 25, 2016 at 2:23 AM, Jann Horn  wrote:
> This allows the admin of a user namespace to mark the namespace as
> transparent. All other namespaces, by default, are opaque.
>
> While the current behavior of user namespaces is appropriate for use in
> containers, there are many programs that only use user namespaces because
> doing so enables them to do other things (e.g. unsharing the mount or
> network namespace) that require namespaced capabilities. For them, the
> inability to see the real UIDs and GIDs of things from inside the user
> namespace can be very annoying.
>
> In a transparent namespace, all UIDs and GIDs that are mapped into its
> first opaque ancestor are visible and are not remapped. This means that if
> a process e.g. stat()s the real root directory in a namespace, it will
> still see it as owned by UID 0.
>
> Traditionally, any UID or GID that was visible in a user namespace was also
> mapped into the namespace, giving the namespace admin full access to it.
> This patch introduces a distinction: In a transparent namespace, UIDs and
> GIDs can be visible without being mapped. Non-mapped, visible UIDs can be
> passed from the kernel to userspace, but userspace can't send them back to
> the kernel.

Can you explain "can't send them back to the kernel" in more detail?
(Some examples of what is and isn't possible would be helpul.)

> In order to be able to fully use specific UIDs/GIDs and gain
> privileges over them, mappings need to be set up in the usual way -
> however, to avoid aliasing problems, only identity mappings are permitted.
>
> v2:
> Ensure that all relevant from_k[ug]id callers show up in the patch.
> _transparent would be more verbose than _tp, but considering the line
> length rule, that's just too long.
>
> Yes, this makes the patch rather large.
>
> Behavior should be the same as in v1, except that I'm not touching orangefs
> in this patch because every single use of from_k[ug]id in it is wrong in
> some way. (Thanks for making me reread all that stuff, Eric.) I'll write a
> separate patch or at least report the issue with more detail later.
>
> (Also, the handling of user namespaces when dealing with signals is
> super-ugly and kind of incorrect. That should probably be cleaned up.)

I'm curious about this detail: can you say some more about the issues here?

> posix_acl_to_xattr would have changed behavior in the v1 patch, but isn't
> changed here. Because it's only used with init_user_ns, that won't change
> user-visible behavior relative to v1.
>
> This patch was compile-tested with allyesconfig. I also ran a VM with this
> patch applied and checked that it still works, but that probably doesn't
> mean much.

One of the things notably lacking from this commit message is any sort
of description of the user-space-API changes that it makes. I presume
it's a matter of some /proc files. Could you explain the changes (ad
add that detail in any further commit message)?

Thanks,

Michael

> Signed-off-by: Jann Horn 
> ---
>  arch/alpha/kernel/osf_sys.c   |   4 +-
>  arch/arm/kernel/sys_oabi-compat.c |   4 +-
>  arch/ia64/kernel/signal.c |   4 +-
>  arch/s390/kernel/compat_linux.c   |  26 +++---
>  arch/sparc/kernel/sys_sparc32.c   |   4 +-
>  arch/x86/ia32/sys_ia32.c  |   4 +-
>  drivers/android/binder.c  |   2 +-
>  drivers/gpu/drm/drm_info.c|   2 +-
>  drivers/gpu/drm/drm_ioctl.c   |   2 +-
>  drivers/net/tun.c |   4 +-
>  fs/autofs4/dev-ioctl.c|   4 +-
>  fs/autofs4/waitq.c|   4 +-
>  fs/binfmt_elf.c   |  12 +--
>  fs/binfmt_elf_fdpic.c |  12 +--
>  fs/compat.c   |   4 +-
>  fs/fcntl.c|   4 +-
>  fs/ncpfs/ioctl.c  |  12 +--
>  fs/posix_acl.c|  11 ++-
>  fs/proc/array.c   |  18 ++--
>  fs/proc/base.c|  30 +--
>  fs/quota/kqid.c   |  12 ++-
>  fs/stat.c |  12 +--
>  include/linux/uidgid.h|  24 +++--
>  include/linux/user_namespace.h|   4 +
>  include/net/scm.h |   4 +-
>  ipc/mqueue.c  |   2 +-
>  ipc/msg.c |   8 +-
>  ipc/sem.c |   8 +-
>  ipc/shm.c |   8 +-
>  ipc/util.c|   8 +-
>  kernel/acct.c |   4 +-
>  kernel/exit.c |   6 +-
>  kernel/groups.c   |   2 +-
>  kernel/signal.c   |  16 ++--
>  kernel/sys.c  |  24 ++---
>  kernel/trace/trace.c  |   2 +-
>  kernel/tsacct.c   |   4 +-
>  kernel/uid16.c|  22 ++---
>  kernel/user.c |   1 +
>  kernel/user_namespace.c   | 178 
> 

Re: [PATCH v2 2/2] namespaces: add transparent user namespaces

2016-06-26 Thread Michael Kerrisk
Hi Jann,

Patches such as this really should CC linux-api@ (added).

On Sat, Jun 25, 2016 at 2:23 AM, Jann Horn  wrote:
> This allows the admin of a user namespace to mark the namespace as
> transparent. All other namespaces, by default, are opaque.
>
> While the current behavior of user namespaces is appropriate for use in
> containers, there are many programs that only use user namespaces because
> doing so enables them to do other things (e.g. unsharing the mount or
> network namespace) that require namespaced capabilities. For them, the
> inability to see the real UIDs and GIDs of things from inside the user
> namespace can be very annoying.
>
> In a transparent namespace, all UIDs and GIDs that are mapped into its
> first opaque ancestor are visible and are not remapped. This means that if
> a process e.g. stat()s the real root directory in a namespace, it will
> still see it as owned by UID 0.
>
> Traditionally, any UID or GID that was visible in a user namespace was also
> mapped into the namespace, giving the namespace admin full access to it.
> This patch introduces a distinction: In a transparent namespace, UIDs and
> GIDs can be visible without being mapped. Non-mapped, visible UIDs can be
> passed from the kernel to userspace, but userspace can't send them back to
> the kernel.

Can you explain "can't send them back to the kernel" in more detail?
(Some examples of what is and isn't possible would be helpul.)

> In order to be able to fully use specific UIDs/GIDs and gain
> privileges over them, mappings need to be set up in the usual way -
> however, to avoid aliasing problems, only identity mappings are permitted.
>
> v2:
> Ensure that all relevant from_k[ug]id callers show up in the patch.
> _transparent would be more verbose than _tp, but considering the line
> length rule, that's just too long.
>
> Yes, this makes the patch rather large.
>
> Behavior should be the same as in v1, except that I'm not touching orangefs
> in this patch because every single use of from_k[ug]id in it is wrong in
> some way. (Thanks for making me reread all that stuff, Eric.) I'll write a
> separate patch or at least report the issue with more detail later.
>
> (Also, the handling of user namespaces when dealing with signals is
> super-ugly and kind of incorrect. That should probably be cleaned up.)

I'm curious about this detail: can you say some more about the issues here?

> posix_acl_to_xattr would have changed behavior in the v1 patch, but isn't
> changed here. Because it's only used with init_user_ns, that won't change
> user-visible behavior relative to v1.
>
> This patch was compile-tested with allyesconfig. I also ran a VM with this
> patch applied and checked that it still works, but that probably doesn't
> mean much.

One of the things notably lacking from this commit message is any sort
of description of the user-space-API changes that it makes. I presume
it's a matter of some /proc files. Could you explain the changes (ad
add that detail in any further commit message)?

Thanks,

Michael

> Signed-off-by: Jann Horn 
> ---
>  arch/alpha/kernel/osf_sys.c   |   4 +-
>  arch/arm/kernel/sys_oabi-compat.c |   4 +-
>  arch/ia64/kernel/signal.c |   4 +-
>  arch/s390/kernel/compat_linux.c   |  26 +++---
>  arch/sparc/kernel/sys_sparc32.c   |   4 +-
>  arch/x86/ia32/sys_ia32.c  |   4 +-
>  drivers/android/binder.c  |   2 +-
>  drivers/gpu/drm/drm_info.c|   2 +-
>  drivers/gpu/drm/drm_ioctl.c   |   2 +-
>  drivers/net/tun.c |   4 +-
>  fs/autofs4/dev-ioctl.c|   4 +-
>  fs/autofs4/waitq.c|   4 +-
>  fs/binfmt_elf.c   |  12 +--
>  fs/binfmt_elf_fdpic.c |  12 +--
>  fs/compat.c   |   4 +-
>  fs/fcntl.c|   4 +-
>  fs/ncpfs/ioctl.c  |  12 +--
>  fs/posix_acl.c|  11 ++-
>  fs/proc/array.c   |  18 ++--
>  fs/proc/base.c|  30 +--
>  fs/quota/kqid.c   |  12 ++-
>  fs/stat.c |  12 +--
>  include/linux/uidgid.h|  24 +++--
>  include/linux/user_namespace.h|   4 +
>  include/net/scm.h |   4 +-
>  ipc/mqueue.c  |   2 +-
>  ipc/msg.c |   8 +-
>  ipc/sem.c |   8 +-
>  ipc/shm.c |   8 +-
>  ipc/util.c|   8 +-
>  kernel/acct.c |   4 +-
>  kernel/exit.c |   6 +-
>  kernel/groups.c   |   2 +-
>  kernel/signal.c   |  16 ++--
>  kernel/sys.c  |  24 ++---
>  kernel/trace/trace.c  |   2 +-
>  kernel/tsacct.c   |   4 +-
>  kernel/uid16.c|  22 ++---
>  kernel/user.c |   1 +
>  kernel/user_namespace.c   | 178 
> +++---
>  

Review of ptrace Yama ptrace_scope description

2016-06-25 Thread Michael Kerrisk (man-pages)

Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.

   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.

   2 ("admin-only attach")
  Only processes with the  CAP_SYS_PTRACE  capability  may
  perform  PTRACE_MODE_ATTACH operations or trace children
  that employ PTRACE_TRACEME.

   3 ("no attach")
  No process may perform PTRACE_MODE_ATTACH operations  or
  trace children that employ PTRACE_TRACEME.

  Once  this value has been written to the file, it cannot
      be changed.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Review of ptrace Yama ptrace_scope description

2016-06-25 Thread Michael Kerrisk (man-pages)

Hi Kees,

So, last year, I added some documentation to ptrace(2) to describe
the Yama ptrace_scope file. I don't think I asked you for review
at the time, but in the light of other changes to the ptrace(2)
page, it occurred to me that it might be a good idea to ask you
to check the text below to see if anything is missing or could be
improved. Might you have a moment for that?

   /proc/sys/kernel/yama/ptrace_scope
   On systems with the Yama Linux Security Module (LSM)  installed
   (i.e.,  the  kernel  was configured with CONFIG_SECURITY_YAMA),
   the /proc/sys/kernel/yama/ptrace_scope  file  (available  since
   Linux  3.4)  can  be  used  to  restrict the ability to trace a
   process with ptrace(2) (and thus also the ability to use  tools
   such  as  strace(1) and gdb(1)).  The goal of such restrictions
   is to prevent attack escalation whereby a  compromised  process
   can  ptrace-attach  to  other  sensitive processes (e.g., a GPG
   agent or an SSH session) owned by the user  in  order  to  gain
   additional credentials and thus expand the scope of the attack.

   More precisely, the Yama LSM limits two types of operations:

   *  Any   operation   that   performs   a   ptrace  access  mode
  PTRACE_MODE_ATTACH check—for  example,  ptrace()
  PTRACE_ATTACH.   (See the "Ptrace access mode checking" dis‐
  cussion above.)

   *  ptrace() PTRACE_TRACEME.

   A process that has the CAP_SYS_PTRACE capability can update the
   /proc/sys/kernel/yama/ptrace_scope file with one of the follow‐
   ing values:

   0 ("classic ptrace permissions")
  No additional restrictions on  operations  that  perform
  PTRACE_MODE_ATTACH  checks  (beyond those imposed by the
  commoncap and other LSMs).

  The use of PTRACE_TRACEME is unchanged.

   1 ("restricted ptrace") [default value]
  When   performing   an   operation   that   requires   a
  PTRACE_MODE_ATTACH  check, the calling process must have
  a predefined relationship with the target  process.   By
  default,  the predefined relationship is that the target
  process must be a child of the caller.

  A target process can employ the prctl(2)  PR_SET_PTRACER
  operation  to declare a different PID that is allowed to
  perform PTRACE_MODE_ATTACH  operations  on  the  target.
  See   the   kernel   source   file   Documentation/secu‐
  rity/Yama.txt for further details.

  The use of PTRACE_TRACEME is unchanged.

   2 ("admin-only attach")
  Only processes with the  CAP_SYS_PTRACE  capability  may
  perform  PTRACE_MODE_ATTACH operations or trace children
  that employ PTRACE_TRACEME.

   3 ("no attach")
  No process may perform PTRACE_MODE_ATTACH operations  or
  trace children that employ PTRACE_TRACEME.

  Once  this value has been written to the file, it cannot
      be changed.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-25 Thread Michael Kerrisk (man-pages)

On 06/24/2016 05:18 PM, Casey Schaufler wrote:



On 6/24/2016 1:40 AM, Michael Kerrisk (man-pages) wrote:

On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:

#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

return 0;
}


So, I just want to check my understanding of a couple of points:

1. The commoncap LSM is invoked first, and if it denies access,
   then no further LSM is/needs to be called.


Yes. The LSM infrastructure is "bail on fail".



2. Is it the case that only one of the other LSMs (SELinux, Yama,
   AppArmor, etc.) is invoked, or can more than one be invoked.
   I thought only one is invoked, but perhaps I am out of date
   in my understanding.


All registered modules are invoked, but only one "major"
module can be registered. The "minor" modules show up in
security_init, while the majors come in via do_security_initcalls.

I am in the process of messing that all up with patches
allowing multiple major modules. Stay tuned.


Thanks for the info, Casey.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-25 Thread Michael Kerrisk (man-pages)

On 06/24/2016 05:18 PM, Casey Schaufler wrote:



On 6/24/2016 1:40 AM, Michael Kerrisk (man-pages) wrote:

On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
 wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:

#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

return 0;
}


So, I just want to check my understanding of a couple of points:

1. The commoncap LSM is invoked first, and if it denies access,
   then no further LSM is/needs to be called.


Yes. The LSM infrastructure is "bail on fail".



2. Is it the case that only one of the other LSMs (SELinux, Yama,
   AppArmor, etc.) is invoked, or can more than one be invoked.
   I thought only one is invoked, but perhaps I am out of date
   in my understanding.


All registered modules are invoked, but only one "major"
module can be registered. The "minor" modules show up in
security_init, while the majors come in via do_security_initcalls.

I am in the process of messing that all up with patches
allowing multiple major modules. Stay tuned.


Thanks for the info, Casey.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-24 Thread Michael Kerrisk (man-pages)

On 06/24/2016 11:52 AM, Thomas Gleixner wrote:

On Fri, 24 Jun 2016, Michael Kerrisk (man-pages) wrote:

By the way, I just realized something that wasn't initially obvious
to me, and documented it in the futex(2) man page:

  Note:  for  FUTEX_WAIT,  timeout is interpreted as a
  relative value.  This differs from other futex oper‐
  ations,  where timeout is interpreted as an absolute
  value.  To obtain the equivalent of FUTEX_WAIT  with
  an  absolute  timeout, employ FUTEX_WAIT_BITSET with
  val3 specified as FUTEX_BITSET_MATCH_ANY.

Okay?


Yes.


Thanks, Thomas.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-24 Thread Michael Kerrisk (man-pages)

On 06/24/2016 11:52 AM, Thomas Gleixner wrote:

On Fri, 24 Jun 2016, Michael Kerrisk (man-pages) wrote:

By the way, I just realized something that wasn't initially obvious
to me, and documented it in the futex(2) man page:

  Note:  for  FUTEX_WAIT,  timeout is interpreted as a
  relative value.  This differs from other futex oper‐
  ations,  where timeout is interpreted as an absolute
  value.  To obtain the equivalent of FUTEX_WAIT  with
  an  absolute  timeout, employ FUTEX_WAIT_BITSET with
  val3 specified as FUTEX_BITSET_MATCH_ANY.

Okay?


Yes.


Thanks, Thomas.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)

Hi Eric,

On 06/23/2016 09:04 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


Hi Eric,

On 06/21/2016 09:55 PM, Eric W. Biederman wrote:

Hmm.

When I gave this level of detail about the user namespace permission
checks you gave me some flack, because it was not particularly
comprehensible to the end users.  I think you deserve the same feedback.

How do we say this in a way that does not describes a useful way to
think about it.  I read this and I know a lot of what is going on and my
mind goes numb.

How about something like this:

   If the callers uid and gid are the same as a processes uids and gids
   and the processes is configured to allow core dumps (aka it was never
   setuid or setgid) then the caller is allowed to ptrace a process.

   Otherwise the caller must have CAP_SYS_PTRACE.

   Linux security modules impose additional restrictions.

   For consistency access to various process attributes are guarded with
   the same security checks as the ptrace system call itself.  As they are
   all methods to get information about a process.

We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Thanks for the feedback, but I think more detail is required than you
suggest. (And I added all of that detail somewhat reluctantly.)
See my other replies for my rationale.


What I saw badly missing from your description is not the level of
detail but bring things into a form that ordinary mortals can
understand.

For an explanation to be clear I think we very much need the high level
overview first.  Then we can expand that description with the very
detailed view.

I very much think we need to describe things in such a way that people
understand the principles behind the permission checks, and not just
have the documentation echo the code, so that people can know what weird
things LSMs like yama are likely to do, and how these checks are likely
to evolve in the future.


So, I completely agree with you, and I agree that this could be better.
At first, I understood your meaning to be that I should avoid all of the
detail, and just limit the man page to some very high level text as
you proposed. So, I think it's worth prefixing the details with some
attempt at a high-level picture. How about this as an introductory
paragraph:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations),  require  so-called  "ptrace  access mode" checks,
   whose outcome determines whether an operation is permitted (or,
   in  a  few cases, causes a "read" operation to return sanitized
   data).  These checks are performed in cases where  one  process
   can  inspect sensitive information about, or in some cases mod‐
   ify the state of, another process.  The  checks  are  based  on
   factors  such  as  the  credentials and capabilities of the two
   processes, whether or not the "target" process is dumpable, and
   the  results  of checks performed by any enabled Linux Security
   Module (LSM)—for example, SELinux, Yama, or  Smack—and  by  the
   commoncap LSM (which is always invoked).

?


Because one thing is clear to me.  The evolution of these details is
clearly not done, and will continue to change in the future.


Maybe people will even write man page patches when that happens :-).

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)

Hi Eric,

On 06/23/2016 09:04 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)"  writes:


Hi Eric,

On 06/21/2016 09:55 PM, Eric W. Biederman wrote:

Hmm.

When I gave this level of detail about the user namespace permission
checks you gave me some flack, because it was not particularly
comprehensible to the end users.  I think you deserve the same feedback.

How do we say this in a way that does not describes a useful way to
think about it.  I read this and I know a lot of what is going on and my
mind goes numb.

How about something like this:

   If the callers uid and gid are the same as a processes uids and gids
   and the processes is configured to allow core dumps (aka it was never
   setuid or setgid) then the caller is allowed to ptrace a process.

   Otherwise the caller must have CAP_SYS_PTRACE.

   Linux security modules impose additional restrictions.

   For consistency access to various process attributes are guarded with
   the same security checks as the ptrace system call itself.  As they are
   all methods to get information about a process.

We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Thanks for the feedback, but I think more detail is required than you
suggest. (And I added all of that detail somewhat reluctantly.)
See my other replies for my rationale.


What I saw badly missing from your description is not the level of
detail but bring things into a form that ordinary mortals can
understand.

For an explanation to be clear I think we very much need the high level
overview first.  Then we can expand that description with the very
detailed view.

I very much think we need to describe things in such a way that people
understand the principles behind the permission checks, and not just
have the documentation echo the code, so that people can know what weird
things LSMs like yama are likely to do, and how these checks are likely
to evolve in the future.


So, I completely agree with you, and I agree that this could be better.
At first, I understood your meaning to be that I should avoid all of the
detail, and just limit the man page to some very high level text as
you proposed. So, I think it's worth prefixing the details with some
attempt at a high-level picture. How about this as an introductory
paragraph:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations),  require  so-called  "ptrace  access mode" checks,
   whose outcome determines whether an operation is permitted (or,
   in  a  few cases, causes a "read" operation to return sanitized
   data).  These checks are performed in cases where  one  process
   can  inspect sensitive information about, or in some cases mod‐
   ify the state of, another process.  The  checks  are  based  on
   factors  such  as  the  credentials and capabilities of the two
   processes, whether or not the "target" process is dumpable, and
   the  results  of checks performed by any enabled Linux Security
   Module (LSM)—for example, SELinux, Yama, or  Smack—and  by  the
   commoncap LSM (which is always invoked).

?


Because one thing is clear to me.  The evolution of these details is
clearly not done, and will continue to change in the future.


Maybe people will even write man page patches when that happens :-).

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)

On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:

#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

return 0;
}


So, I just want to check my understanding of a couple of points:

1. The commoncap LSM is invoked first, and if it denies access,
   then no further LSM is/needs to be called.

2. Is it the case that only one of the other LSMs (SELinux, Yama,
   AppArmor, etc.) is invoked, or can more than one be invoked.
   I thought only one is invoked, but perhaps I am out of date
   in my understanding.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)

On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
 wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:

#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

return 0;
}


So, I just want to check my understanding of a couple of points:

1. The commoncap LSM is invoked first, and if it denies access,
   then no further LSM is/needs to be called.

2. Is it the case that only one of the other LSMs (SELinux, Yama,
   AppArmor, etc.) is invoked, or can more than one be invoked.
   I thought only one is invoked, but perhaps I am out of date
   in my understanding.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)

Stephen,

On 06/23/2016 08:05 PM, Stephen Smalley wrote:

On 06/21/2016 05:41 AM, Michael Kerrisk (man-pages) wrote:

Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.

Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.

[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)


That was the intent when the distinction was introduced, but it doesn't
appear to have been properly maintained, e.g. there is now a common
helper lock_trace() that is used for
/proc/pid/{stack,syscall,personality} but checks PTRACE_MODE_ATTACH, and
PTRACE_MODE_ATTACH is also used in timerslack_ns_write/show().  Likely
should review and make them consistent.  There was also some debate
about proper handling of /proc/pid/fd.  Arguably that one might belong
back in the _ATTACH camp.


Thanks for the background info.


   Since  Linux  4.5, the above access mode checks may be combined
   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]


Some ptrace access mode checks, such as checks when reading
/proc/pid/stat, merely cause the output to be filtered/sanitized rather
than an error to be returned to the caller.  In these cases, accessing
the file is not a security violation and there is no reason to generate
a security audit record.  This modifier suppresses the generation of
such an audit record for the particular access check.


Thanks, I've added that text to the man page more or less as you
gave it here.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)

Stephen,

On 06/23/2016 08:05 PM, Stephen Smalley wrote:

On 06/21/2016 05:41 AM, Michael Kerrisk (man-pages) wrote:

Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.

Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.

[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)


That was the intent when the distinction was introduced, but it doesn't
appear to have been properly maintained, e.g. there is now a common
helper lock_trace() that is used for
/proc/pid/{stack,syscall,personality} but checks PTRACE_MODE_ATTACH, and
PTRACE_MODE_ATTACH is also used in timerslack_ns_write/show().  Likely
should review and make them consistent.  There was also some debate
about proper handling of /proc/pid/fd.  Arguably that one might belong
back in the _ATTACH camp.


Thanks for the background info.


   Since  Linux  4.5, the above access mode checks may be combined
   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]


Some ptrace access mode checks, such as checks when reading
/proc/pid/stat, merely cause the output to be filtered/sanitized rather
than an error to be returned to the caller.  In these cases, accessing
the file is not a security violation and there is no reason to generate
a security audit record.  This modifier suppresses the generation of
such an audit record for the particular access check.


Thanks, I've added that text to the man page more or less as you
gave it here.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)

On 06/23/2016 08:56 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


Hi Oleg,

On 06/22/2016 11:51 PM, Oleg Nesterov wrote:

On 06/21, Eric W. Biederman wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.


so I have to admit that I never even tried to actually understand
ptrace_may_access ;)


We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Personally I agree. I think "man ptrace" shouldn't not tell too much
about kernel internals.


See my other replies on this topic. Somehow, we need a way of
describing the behavior that user-space sees. I think it's
inevitable that that means talking about what;s going on
"under the hood".

Regarding Eric's point that "we run the danger of people beginning
to think they can depend upon bugs in the implementation": when it
comes to breaking the ABI, the presence or absence of documentation
doesn't save us on that point (Linus has a few times made his position
wrt to documentation clear).


Which are interesting in this respect as a bug in the implementation
that is a security issue can and will be changed, even if userspace
breaks.  Breaking userspace is not desirable but when there is no other
reasonable choice it will happen.


Yes, good point.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-24 Thread Michael Kerrisk (man-pages)

On 06/23/2016 08:56 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)"  writes:


Hi Oleg,

On 06/22/2016 11:51 PM, Oleg Nesterov wrote:

On 06/21, Eric W. Biederman wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.


so I have to admit that I never even tried to actually understand
ptrace_may_access ;)


We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Personally I agree. I think "man ptrace" shouldn't not tell too much
about kernel internals.


See my other replies on this topic. Somehow, we need a way of
describing the behavior that user-space sees. I think it's
inevitable that that means talking about what;s going on
"under the hood".

Regarding Eric's point that "we run the danger of people beginning
to think they can depend upon bugs in the implementation": when it
comes to breaking the ABI, the presence or absence of documentation
doesn't save us on that point (Linus has a few times made his position
wrt to documentation clear).


Which are interesting in this respect as a bug in the implementation
that is a security issue can and will be changed, even if userspace
breaks.  Breaking userspace is not desirable but when there is no other
reasonable choice it will happen.


Yes, good point.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-24 Thread Michael Kerrisk (man-pages)

On 06/23/2016 09:53 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 08:35:15PM +0200, Michael Kerrisk (man-pages) wrote:

Hi Darren,

On 06/23/2016 06:16 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:
Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner <t...@linutronix.de> wrote:

On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:

And that universe would love to have your documentation of
FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),


I give you almost the full treatment, but I leave REQUEUE_PI to Darren
and FUTEX_WAKE_OP to Jakub. :)
[...]
FUTEX_CLOCK_REALTIME

This option bit can be ored on the futex ops FUTEX_WAIT_BITSET
and FUTEX_WAIT_REQUEUE_PI

If set the kernel treats the user space supplied timeout as
absolute time based on CLOCK_REALTIME.

If not set the kernel treats the user space supplied timeout
as relative time.

Unfortunately, I should have checked the code more carefully...


Me too :)


Seems to be going around...




Looking more carefully at the code, I see understand the situation
is the following:

FUTEX_LOCK_PI
Always uses CLOCK_REALTIME
'timeout' is absolute


Yes.


FUTEX_WAIT_REQUEUE_PI
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT_BITSET
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is relative


Yes.


I've amended the man page to describe those details.


OK, that confirms my question, timeout interpretation as relative or absolute is
based on the op code, not the CLOCK flag.




The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time.


When you say that the "flag was added", which flag do you mean? Or, did you
mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute
time".


I didn't express myself clearly. When Darren added the support for
CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout
support. Anything else does not make sense.


I sent that patch because reading the new man page it struck me as strange that
FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not,
especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to
ALL.

I didn't realize the impact to relative/absolute interpretation of the timeout
value at the time.

I think it was a mistake to introduce a change that made FUTEX_WAIT interpret
the timeout differently based on the CLOCK flag,


I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT
always interprets the clock as relative, regardless of presence/absence of
FUTEX_CLOCK_REALTIME? Am I missing something?


No you're not. The code as it stands today is always relative, but it gets the
base time from the wrong clock source in the case of FUTEX_CLOCK_REALTIME.


Ahh yes, I'd clicked to that, but forgot to say so.


I was stating that I think it would be a mistake to add absolute timeout to
FUTEX_WAIT based on the FUTEX_CLOCK_REALTIME flag, which is how Thomas describes
above his interpretation of my earlier change.


Got it now. Thanks for the clarification, Darren.

Cheers

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-24 Thread Michael Kerrisk (man-pages)

On 06/23/2016 09:53 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 08:35:15PM +0200, Michael Kerrisk (man-pages) wrote:

Hi Darren,

On 06/23/2016 06:16 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:
Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner  wrote:

On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:

And that universe would love to have your documentation of
FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),


I give you almost the full treatment, but I leave REQUEUE_PI to Darren
and FUTEX_WAKE_OP to Jakub. :)
[...]
FUTEX_CLOCK_REALTIME

This option bit can be ored on the futex ops FUTEX_WAIT_BITSET
and FUTEX_WAIT_REQUEUE_PI

If set the kernel treats the user space supplied timeout as
absolute time based on CLOCK_REALTIME.

If not set the kernel treats the user space supplied timeout
as relative time.

Unfortunately, I should have checked the code more carefully...


Me too :)


Seems to be going around...




Looking more carefully at the code, I see understand the situation
is the following:

FUTEX_LOCK_PI
Always uses CLOCK_REALTIME
'timeout' is absolute


Yes.


FUTEX_WAIT_REQUEUE_PI
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT_BITSET
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is relative


Yes.


I've amended the man page to describe those details.


OK, that confirms my question, timeout interpretation as relative or absolute is
based on the op code, not the CLOCK flag.




The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time.


When you say that the "flag was added", which flag do you mean? Or, did you
mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute
time".


I didn't express myself clearly. When Darren added the support for
CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout
support. Anything else does not make sense.


I sent that patch because reading the new man page it struck me as strange that
FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not,
especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to
ALL.

I didn't realize the impact to relative/absolute interpretation of the timeout
value at the time.

I think it was a mistake to introduce a change that made FUTEX_WAIT interpret
the timeout differently based on the CLOCK flag,


I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT
always interprets the clock as relative, regardless of presence/absence of
FUTEX_CLOCK_REALTIME? Am I missing something?


No you're not. The code as it stands today is always relative, but it gets the
base time from the wrong clock source in the case of FUTEX_CLOCK_REALTIME.


Ahh yes, I'd clicked to that, but forgot to say so.


I was stating that I think it would be a mistake to add absolute timeout to
FUTEX_WAIT based on the FUTEX_CLOCK_REALTIME flag, which is how Thomas describes
above his interpretation of my earlier change.


Got it now. Thanks for the clarification, Darren.

Cheers

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

On 06/23/2016 08:28 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 07:26:52PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:
In my opinion, we should treat the timeout value as relative for FUTEX_WAIT
regardless of the CLOCK used.


Which requires even more changes as you have to select which clock you are
using for adding the base time.


Right, something like the following?


diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..c39d807 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,8 +3230,12 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, 
u32, val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
-   t = ktime_add_safe(ktime_get(), t);
+   if (cmd == FUTEX_WAIT) {
+   if (cmd & FUTEX_CLOCK_REALTIME)
+   t = ktime_add_safe(ktime_get_real(), t);
+   else
+   t = ktime_add_safe(ktime_get(), t);
+   }
tp = 
}
/*


Just in the interests of readability/maintainability, might it not
make some sense to recode the timeout handling for FUTEX_WAIT
within futex_wait(). I think that part of the reason we're in this
mess of inconsistency is that timeout interpretation is being handled
at too many different points in the code.


And as a follow-on, what is the reason for FUTEX_LOCK_PI only using
CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a
specific amount of time, regardless of wall time.


Yes, that's another weird inconsistency.

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

On 06/23/2016 08:28 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 07:26:52PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:
In my opinion, we should treat the timeout value as relative for FUTEX_WAIT
regardless of the CLOCK used.


Which requires even more changes as you have to select which clock you are
using for adding the base time.


Right, something like the following?


diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..c39d807 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,8 +3230,12 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, 
u32, val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
-   t = ktime_add_safe(ktime_get(), t);
+   if (cmd == FUTEX_WAIT) {
+   if (cmd & FUTEX_CLOCK_REALTIME)
+   t = ktime_add_safe(ktime_get_real(), t);
+   else
+   t = ktime_add_safe(ktime_get(), t);
+   }
tp = 
}
/*


Just in the interests of readability/maintainability, might it not
make some sense to recode the timeout handling for FUTEX_WAIT
within futex_wait(). I think that part of the reason we're in this
mess of inconsistency is that timeout interpretation is being handled
at too many different points in the code.


And as a follow-on, what is the reason for FUTEX_LOCK_PI only using
CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a
specific amount of time, regardless of wall time.


Yes, that's another weird inconsistency.

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Darren,

On 06/23/2016 06:16 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:
Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner <t...@linutronix.de> wrote:

On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:

And that universe would love to have your documentation of
FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),


I give you almost the full treatment, but I leave REQUEUE_PI to Darren
and FUTEX_WAKE_OP to Jakub. :)
[...]
FUTEX_CLOCK_REALTIME

This option bit can be ored on the futex ops FUTEX_WAIT_BITSET
and FUTEX_WAIT_REQUEUE_PI

If set the kernel treats the user space supplied timeout as
absolute time based on CLOCK_REALTIME.

If not set the kernel treats the user space supplied timeout
as relative time.

Unfortunately, I should have checked the code more carefully...


Me too :)


Seems to be going around...




Looking more carefully at the code, I see understand the situation
is the following:

FUTEX_LOCK_PI
Always uses CLOCK_REALTIME
'timeout' is absolute


Yes.


FUTEX_WAIT_REQUEUE_PI
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT_BITSET
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is relative


Yes.


I've amended the man page to describe those details.


OK, that confirms my question, timeout interpretation as relative or absolute is
based on the op code, not the CLOCK flag.




The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time.


When you say that the "flag was added", which flag do you mean? Or, did you
mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute
time".


I didn't express myself clearly. When Darren added the support for
CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout
support. Anything else does not make sense.


I sent that patch because reading the new man page it struck me as strange that
FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not,
especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to
ALL.

I didn't realize the impact to relative/absolute interpretation of the timeout
value at the time.

I think it was a mistake to introduce a change that made FUTEX_WAIT interpret
the timeout differently based on the CLOCK flag,


I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT
always interprets the clock as relative, regardless of presence/absence of
FUTEX_CLOCK_REALTIME? Am I missing something?


while that interpretation is
independent of the CLOCK flag for all other op codes.

In my opinion, we should treat the timeout value as relative for FUTEX_WAIT
regardless of the CLOCK used.


I realize it's historical, but it is really weird that FUTEX_WAIT interprets
time timeout (relative vs absolute) differently from all of the other
operations.

That would require a change to the man page to eliminate the relative/absolute
language in the FUTEX_CLOCK_REALTIME definition and explicit definitions of the
interpretation for each op code (as Matthew explains above).

Do we agree on that?


Yes.

The man page changes are already in Git. My earlier reply contained the
commit ref:
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/commit/?id=8064bfa5369c6856f606004d02e48ab275e05bed

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Darren,

On 06/23/2016 06:16 PM, Darren Hart wrote:

On Thu, Jun 23, 2016 at 03:40:36PM +0200, Thomas Gleixner wrote:

On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:
Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner  wrote:

On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:

And that universe would love to have your documentation of
FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),


I give you almost the full treatment, but I leave REQUEUE_PI to Darren
and FUTEX_WAKE_OP to Jakub. :)
[...]
FUTEX_CLOCK_REALTIME

This option bit can be ored on the futex ops FUTEX_WAIT_BITSET
and FUTEX_WAIT_REQUEUE_PI

If set the kernel treats the user space supplied timeout as
absolute time based on CLOCK_REALTIME.

If not set the kernel treats the user space supplied timeout
as relative time.

Unfortunately, I should have checked the code more carefully...


Me too :)


Seems to be going around...




Looking more carefully at the code, I see understand the situation
is the following:

FUTEX_LOCK_PI
Always uses CLOCK_REALTIME
'timeout' is absolute


Yes.


FUTEX_WAIT_REQUEUE_PI
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT_BITSET
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute


Yes


FUTEX_WAIT
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is relative


Yes.


I've amended the man page to describe those details.


OK, that confirms my question, timeout interpretation as relative or absolute is
based on the op code, not the CLOCK flag.




The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time.


When you say that the "flag was added", which flag do you mean? Or, did you
mean: "applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute
time".


I didn't express myself clearly. When Darren added the support for
CLOCK_REALTIME to FUTEX_WAIT I think he wanted to add absolute timeout
support. Anything else does not make sense.


I sent that patch because reading the new man page it struck me as strange that
FUTEX_WAIT was restricted to CLOCK_MONOTONIC and the other op codes were not,
especially since FUTEX_WAIT is a just FUTEX_WAIT_BITSET with the mask set to
ALL.

I didn't realize the impact to relative/absolute interpretation of the timeout
value at the time.

I think it was a mistake to introduce a change that made FUTEX_WAIT interpret
the timeout differently based on the CLOCK flag,


I'm missing something. Where does it do that? As far as I can tell FUTEX_WAIT
always interprets the clock as relative, regardless of presence/absence of
FUTEX_CLOCK_REALTIME? Am I missing something?


while that interpretation is
independent of the CLOCK flag for all other op codes.

In my opinion, we should treat the timeout value as relative for FUTEX_WAIT
regardless of the CLOCK used.


I realize it's historical, but it is really weird that FUTEX_WAIT interprets
time timeout (relative vs absolute) differently from all of the other
operations.

That would require a change to the man page to eliminate the relative/absolute
language in the FUTEX_CLOCK_REALTIME definition and explicit definitions of the
interpretation for each op code (as Matthew explains above).

Do we agree on that?


Yes.

The man page changes are already in Git. My earlier reply contained the
commit ref:
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/commit/?id=8064bfa5369c6856f606004d02e48ab275e05bed

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:

On Wed, 22 Jun 2016, Darren Hart wrote:

However, I don't think the patch below is correct. The existing logic
determines the type of timeout based on the futex_op when it should instead
determine the type of timeout based on the FUTEX_CLOCK_REALTIME flag.


No.


My reading of the man page is that FUTEX_WAIT_BITSET abides by the timeout
interpretation defined by the FUTEX_CLOCK_REALTIME attribute, so
SYSCALL_DEFINE6 was misbehaving for FUTEX_WAIT|FUTEX_CLOCK_REALTIME (where the
timeout should have been treated as absolute) as well as for
FUTEX_WAIT_BITSET|FUTEX_CLOCK_MONOTONIC (where the timeout should have been
treated as relative).

Consider the following:

diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..fa2af29 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
+   if (!(cmd & FUTEX_CLOCK_REALTIME))
t = ktime_add_safe(ktime_get(), t);


That breaks LOCK_PI, REQUEUE_PI and FUTEX_WAIT_BITSET


The concern for me is whether the code is incorrect, or if the man page is
incorrect. Does existing userspace code expect the FUTEX_WAIT_BITSET op to
always use an absolute timeout, regardless of the CLOCK used?


FUTEX_WAIT_BITSET, LOCK_PI and REQUEUE_PI always expect absolute time in
CLOCK_REALTIME independent of the CLOCK_REALTIME flag.


Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner <t...@linutronix.de> wrote:

On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:

And that universe would love to have your documentation of
FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),


I give you almost the full treatment, but I leave REQUEUE_PI to Darren
and FUTEX_WAKE_OP to Jakub. :)
[...]
FUTEX_CLOCK_REALTIME

This option bit can be ored on the futex ops FUTEX_WAIT_BITSET
and FUTEX_WAIT_REQUEUE_PI

If set the kernel treats the user space supplied timeout as
absolute time based on CLOCK_REALTIME.

If not set the kernel treats the user space supplied timeout
as relative time.


Unfortunately, I should have checked the code more carefully...

Looking more carefully at the code, I see understand the situation
is the following:

FUTEX_LOCK_PI
Always uses CLOCK_REALTIME
'timeout' is absolute

FUTEX_WAIT_REQUEUE_PI
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute

FUTEX_WAIT_BITSET
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute

FUTEX_WAIT
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is relative

Right?

I've amended the man page to describe those details.


The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time.


When you say that the "flag was added", which flag do you mean? Or, did you 
mean:
"applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute time".


diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..4bee915 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
+   if (cmd == FUTEX_WAIT && !(op & FUTEX_CLOCK_REALTIME))
t = ktime_add_safe(ktime_get(), t);
tp = 
}


So this patch is correct and if the man page is unclear about it then we need
to fix that.


So, my fixes to the man page just now are at
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/commit/?id=8064bfa5369c6856f606004d02e48ab275e05bed

If Matthieu's patch is applied, obviously a further fix will
be needed needed in the description of FUTEX_WAIT.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Darren,

On 06/23/2016 06:48 AM, Darren Hart wrote:

On Mon, Jun 20, 2016 at 04:26:52PM +0200, Matthieu CASTET wrote:

Hi,

the commit 337f13046ff03717a9e99675284a817527440a49 is saying that it
change to syscall to an equivalent to FUTEX_WAIT_BITSET |
FUTEX_CLOCK_REALTIME with a bitset of FUTEX_BITSET_MATCH_ANY.

It seems wrong to me, because in case of FUTEX_WAIT, in
"SYSCALL_DEFINE6(futex", we convert relative timeout to absolute
timeout [1].

So FUTEX_CLOCK_REALTIME | FUTEX_WAIT is expecting a relative timeout
when FUTEX_WAIT_BITSET take an absolute timeout.

To make it work you have to use something like the (untested) attached
patch.


+Eric Dumazet

Thanks for reporting Matthieu,

FUTEX_WAIT traditionally used a relative timeout with CLOCK_MONOTONIC while
FUTEX_WAIT_BITSET could use either ??? based on the FUTEX_CLOCK_ flag used. The
man page is not particularly clear on this:

http://man7.org/linux/man-pages/man2/futex.2.html

"
The FUTEX_WAIT_BITSET operation also interprets the timeout argument
differently from FUTEX_WAIT.  See the discussion of FUTEX_CLOCK_REALTIME,
above.
"

Matthew Kerrisk:
I think this language could be removed now that we support the
FUTEX_CLOCK_REALTIME flag for both futex ops.


Done.


As for the intended behavior of the FUTEX_CLOCK_REALTIME flag:


FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
This option bit can be employed only with the FUTEX_WAIT_BITSET,
FUTEX_WAIT_REQUEUE_PI, and FUTEX_WAIT (since Linux 4.5) operations.

(NOTE: FUTEX_WAIT was recently added after the patch in question here)

If this option is set, the kernel treats timeout as an absolute time based
on CLOCK_REALTIME.

If this option is not set, the kernel treats timeout as a relative time,
measured against the CLOCK_MONOTONIC clock.


This supports your argument Matthieu. The assumption of a relative timeout for
FUTEX_WAIT in SYSCALL_DEFINE6 needs to be updated to account for FUTEX_WAIT now
honoring the FUTEX_CLOCK_REALTIME flag, which treats the timeout as absolute.

However, I don't think the patch below is correct. The existing logic
determines the type of timeout based on the futex_op when it should instead
determine the type of timeout based on the FUTEX_CLOCK_REALTIME flag.

My reading of the man page is that FUTEX_WAIT_BITSET abides by the timeout
interpretation defined by the FUTEX_CLOCK_REALTIME attribute, so
SYSCALL_DEFINE6 was misbehaving for FUTEX_WAIT|FUTEX_CLOCK_REALTIME (where the
timeout should have been treated as absolute) as well as for
FUTEX_WAIT_BITSET|FUTEX_CLOCK_MONOTONIC (where the timeout should have been
treated as relative).

Consider the following:

diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..fa2af29 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
+   if (!(cmd & FUTEX_CLOCK_REALTIME))
t = ktime_add_safe(ktime_get(), t);
tp = 
}

The concern for me is whether the code is incorrect, or if the man page is
incorrect. Does existing userspace code expect the FUTEX_WAIT_BITSET op to
always use an absolute timeout, regardless of the CLOCK used?


So, there clearly seem to be some things broken in the man page. See the
reply I sent to tglx.

Cheers,

Michael



[1]
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);



diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..4bee915 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
+   if (cmd == FUTEX_WAIT && !(op & FUTEX_CLOCK_REALTIME))
t = ktime_add_safe(ktime_get(), t);
        tp = 
}






--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

On 06/23/2016 09:18 AM, Thomas Gleixner wrote:

On Wed, 22 Jun 2016, Darren Hart wrote:

However, I don't think the patch below is correct. The existing logic
determines the type of timeout based on the futex_op when it should instead
determine the type of timeout based on the FUTEX_CLOCK_REALTIME flag.


No.


My reading of the man page is that FUTEX_WAIT_BITSET abides by the timeout
interpretation defined by the FUTEX_CLOCK_REALTIME attribute, so
SYSCALL_DEFINE6 was misbehaving for FUTEX_WAIT|FUTEX_CLOCK_REALTIME (where the
timeout should have been treated as absolute) as well as for
FUTEX_WAIT_BITSET|FUTEX_CLOCK_MONOTONIC (where the timeout should have been
treated as relative).

Consider the following:

diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..fa2af29 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
+   if (!(cmd & FUTEX_CLOCK_REALTIME))
t = ktime_add_safe(ktime_get(), t);


That breaks LOCK_PI, REQUEUE_PI and FUTEX_WAIT_BITSET


The concern for me is whether the code is incorrect, or if the man page is
incorrect. Does existing userspace code expect the FUTEX_WAIT_BITSET op to
always use an absolute timeout, regardless of the CLOCK used?


FUTEX_WAIT_BITSET, LOCK_PI and REQUEUE_PI always expect absolute time in
CLOCK_REALTIME independent of the CLOCK_REALTIME flag.


Once upon a time, you told me the following:

On 15 May 2014 at 16:14, Thomas Gleixner  wrote:

On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:

And that universe would love to have your documentation of
FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),


I give you almost the full treatment, but I leave REQUEUE_PI to Darren
and FUTEX_WAKE_OP to Jakub. :)
[...]
FUTEX_CLOCK_REALTIME

This option bit can be ored on the futex ops FUTEX_WAIT_BITSET
and FUTEX_WAIT_REQUEUE_PI

If set the kernel treats the user space supplied timeout as
absolute time based on CLOCK_REALTIME.

If not set the kernel treats the user space supplied timeout
as relative time.


Unfortunately, I should have checked the code more carefully...

Looking more carefully at the code, I see understand the situation
is the following:

FUTEX_LOCK_PI
Always uses CLOCK_REALTIME
'timeout' is absolute

FUTEX_WAIT_REQUEUE_PI
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute

FUTEX_WAIT_BITSET
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is absolute

FUTEX_WAIT
Choice of clock (CLOCK_REALTIME vs CLOCK_MONOTONIC) is
determined by presence or absence of
FUTEX_CLOCK_REALTIME flag
'timeout' is relative

Right?

I've amended the man page to describe those details.


The flag was explicitely added to allow FUTEX_WAIT to hand in absolute time.


When you say that the "flag was added", which flag do you mean? Or, did you 
mean:
"applying Matthieu's patch will allow FUTEX_WAIT to hand in absolute time".


diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..4bee915 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
+   if (cmd == FUTEX_WAIT && !(op & FUTEX_CLOCK_REALTIME))
t = ktime_add_safe(ktime_get(), t);
tp = 
}


So this patch is correct and if the man page is unclear about it then we need
to fix that.


So, my fixes to the man page just now are at
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/commit/?id=8064bfa5369c6856f606004d02e48ab275e05bed

If Matthieu's patch is applied, obviously a further fix will
be needed needed in the description of FUTEX_WAIT.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Darren,

On 06/23/2016 06:48 AM, Darren Hart wrote:

On Mon, Jun 20, 2016 at 04:26:52PM +0200, Matthieu CASTET wrote:

Hi,

the commit 337f13046ff03717a9e99675284a817527440a49 is saying that it
change to syscall to an equivalent to FUTEX_WAIT_BITSET |
FUTEX_CLOCK_REALTIME with a bitset of FUTEX_BITSET_MATCH_ANY.

It seems wrong to me, because in case of FUTEX_WAIT, in
"SYSCALL_DEFINE6(futex", we convert relative timeout to absolute
timeout [1].

So FUTEX_CLOCK_REALTIME | FUTEX_WAIT is expecting a relative timeout
when FUTEX_WAIT_BITSET take an absolute timeout.

To make it work you have to use something like the (untested) attached
patch.


+Eric Dumazet

Thanks for reporting Matthieu,

FUTEX_WAIT traditionally used a relative timeout with CLOCK_MONOTONIC while
FUTEX_WAIT_BITSET could use either ??? based on the FUTEX_CLOCK_ flag used. The
man page is not particularly clear on this:

http://man7.org/linux/man-pages/man2/futex.2.html

"
The FUTEX_WAIT_BITSET operation also interprets the timeout argument
differently from FUTEX_WAIT.  See the discussion of FUTEX_CLOCK_REALTIME,
above.
"

Matthew Kerrisk:
I think this language could be removed now that we support the
FUTEX_CLOCK_REALTIME flag for both futex ops.


Done.


As for the intended behavior of the FUTEX_CLOCK_REALTIME flag:


FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
This option bit can be employed only with the FUTEX_WAIT_BITSET,
FUTEX_WAIT_REQUEUE_PI, and FUTEX_WAIT (since Linux 4.5) operations.

(NOTE: FUTEX_WAIT was recently added after the patch in question here)

If this option is set, the kernel treats timeout as an absolute time based
on CLOCK_REALTIME.

If this option is not set, the kernel treats timeout as a relative time,
measured against the CLOCK_MONOTONIC clock.


This supports your argument Matthieu. The assumption of a relative timeout for
FUTEX_WAIT in SYSCALL_DEFINE6 needs to be updated to account for FUTEX_WAIT now
honoring the FUTEX_CLOCK_REALTIME flag, which treats the timeout as absolute.

However, I don't think the patch below is correct. The existing logic
determines the type of timeout based on the futex_op when it should instead
determine the type of timeout based on the FUTEX_CLOCK_REALTIME flag.

My reading of the man page is that FUTEX_WAIT_BITSET abides by the timeout
interpretation defined by the FUTEX_CLOCK_REALTIME attribute, so
SYSCALL_DEFINE6 was misbehaving for FUTEX_WAIT|FUTEX_CLOCK_REALTIME (where the
timeout should have been treated as absolute) as well as for
FUTEX_WAIT_BITSET|FUTEX_CLOCK_MONOTONIC (where the timeout should have been
treated as relative).

Consider the following:

diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..fa2af29 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
+   if (!(cmd & FUTEX_CLOCK_REALTIME))
t = ktime_add_safe(ktime_get(), t);
tp = 
}

The concern for me is whether the code is incorrect, or if the man page is
incorrect. Does existing userspace code expect the FUTEX_WAIT_BITSET op to
always use an absolute timeout, regardless of the CLOCK used?


So, there clearly seem to be some things broken in the man page. See the
reply I sent to tglx.

Cheers,

Michael



[1]
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);



diff --git a/kernel/futex.c b/kernel/futex.c
index 33664f7..4bee915 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3230,7 +3230,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
return -EINVAL;

t = timespec_to_ktime(ts);
-   if (cmd == FUTEX_WAIT)
+   if (cmd == FUTEX_WAIT && !(op & FUTEX_CLOCK_REALTIME))
t = ktime_add_safe(ktime_get(), t);
        tp = 
}






--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Jann,

Thanks for your further review. Follow-up of one point below.

On 06/23/2016 12:44 AM, Jann Horn wrote:

On Wed, Jun 22, 2016 at 09:21:29PM +0200, Michael Kerrisk (man-pages) wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote:


[...]


  The  algorithm  employed for ptrace access mode checking deter‐
  mines whether the calling process is  allowed  to  perform  the
  corresponding action on the target process, as follows:

  1.  If the calling thread and the target thread are in the same
  thread group, access is always allowed.

  2.  If the access mode specifies PTRACE_MODE_FSCREDS, then  for
  the  check in the next step, employ the caller's filesystem
  user ID and group ID (see credentials(7));  otherwise  (the
  access  mode  specifies  PTRACE_MODE_REALCREDS, so) use the
  caller's real user ID and group ID.


Might want to add a "for historical reasons" or so here.


Can you be a little more precise about "here", and maybe tell me why
you think it helps?


I'm not sure, but it might be a good idea to add something like this at the
end of 2.:
"(Most other APIs that check one of the caller's UIDs use the effective one.
This API uses the real UID instead for historical reasons.)"

In my opinion, it is inconsistent to use the real UID/GID here, the
effective one would be more appropriate. But since the existing code uses
the real UID/GID and that's not a security issue for existing users of
the ptrace API, this wasn't changed when I added the REALCREDS/FSCREDS
distinction.

I think that for a reader, it might help to point out that in most cases,
when a process is the subject in an access check, its effective UID/GID
are used, and this is (together with kill()) an exception to that rule.
But you're the expert on writing documentation, if you think that that's
too much detail / confusing here, it probably is.


Okay -- got it now, I think. I made this text:

   2.  If the access mode specifies PTRACE_MODE_FSCREDS, then, for
   the check in the next step, employ the caller's  filesystem
   UID  and  GID.  (As noted in credentials(7), the filesystem
   UID and GID almost always have the same values as the  cor‐
   responding effective IDs.)

   Otherwise, the access mode specifies PTRACE_MODE_REALCREDS,
   so use the caller's real UID and GID for the checks in  the
   next  step.  (Most APIs that check the caller's UID and GID
   use  the  effective  IDs.   For  historical  reasons,   the
   PTRACE_MODE_REALCREDS check uses the real IDs instead.)

[...]

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Jann,

Thanks for your further review. Follow-up of one point below.

On 06/23/2016 12:44 AM, Jann Horn wrote:

On Wed, Jun 22, 2016 at 09:21:29PM +0200, Michael Kerrisk (man-pages) wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote:


[...]


  The  algorithm  employed for ptrace access mode checking deter‐
  mines whether the calling process is  allowed  to  perform  the
  corresponding action on the target process, as follows:

  1.  If the calling thread and the target thread are in the same
  thread group, access is always allowed.

  2.  If the access mode specifies PTRACE_MODE_FSCREDS, then  for
  the  check in the next step, employ the caller's filesystem
  user ID and group ID (see credentials(7));  otherwise  (the
  access  mode  specifies  PTRACE_MODE_REALCREDS, so) use the
  caller's real user ID and group ID.


Might want to add a "for historical reasons" or so here.


Can you be a little more precise about "here", and maybe tell me why
you think it helps?


I'm not sure, but it might be a good idea to add something like this at the
end of 2.:
"(Most other APIs that check one of the caller's UIDs use the effective one.
This API uses the real UID instead for historical reasons.)"

In my opinion, it is inconsistent to use the real UID/GID here, the
effective one would be more appropriate. But since the existing code uses
the real UID/GID and that's not a security issue for existing users of
the ptrace API, this wasn't changed when I added the REALCREDS/FSCREDS
distinction.

I think that for a reader, it might help to point out that in most cases,
when a process is the subject in an access check, its effective UID/GID
are used, and this is (together with kill()) an exception to that rule.
But you're the expert on writing documentation, if you think that that's
too much detail / confusing here, it probably is.


Okay -- got it now, I think. I made this text:

   2.  If the access mode specifies PTRACE_MODE_FSCREDS, then, for
   the check in the next step, employ the caller's  filesystem
   UID  and  GID.  (As noted in credentials(7), the filesystem
   UID and GID almost always have the same values as the  cor‐
   responding effective IDs.)

   Otherwise, the access mode specifies PTRACE_MODE_REALCREDS,
   so use the caller's real UID and GID for the checks in  the
   next  step.  (Most APIs that check the caller's UID and GID
   use  the  effective  IDs.   For  historical  reasons,   the
   PTRACE_MODE_REALCREDS check uses the real IDs instead.)

[...]

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Oleg,

On 06/22/2016 11:51 PM, Oleg Nesterov wrote:

On 06/21, Eric W. Biederman wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.


so I have to admit that I never even tried to actually understand
ptrace_may_access ;)


We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Personally I agree. I think "man ptrace" shouldn't not tell too much
about kernel internals.


See my other replies on this topic. Somehow, we need a way of
describing the behavior that user-space sees. I think it's
inevitable that that means talking about what;s going on
"under the hood".

Regarding Eric's point that "we run the danger of people beginning
to think they can depend upon bugs in the implementation": when it
comes to breaking the ABI, the presence or absence of documentation
doesn't save us on that point (Linus has a few times made his position
wrt to documentation clear).

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-23 Thread Michael Kerrisk (man-pages)

Hi Oleg,

On 06/22/2016 11:51 PM, Oleg Nesterov wrote:

On 06/21, Eric W. Biederman wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.


so I have to admit that I never even tried to actually understand
ptrace_may_access ;)


We certainly need something that gives a high level view so people
reading the man page can know what to expect.   If you get down into the
weeds we run the danger of people beginning to think they can depend
upon bugs in the implementation.


Personally I agree. I think "man ptrace" shouldn't not tell too much
about kernel internals.


See my other replies on this topic. Somehow, we need a way of
describing the behavior that user-space sees. I think it's
inevitable that that means talking about what;s going on
"under the hood".

Regarding Eric's point that "we run the danger of people beginning
to think they can depend upon bugs in the implementation": when it
comes to breaking the ABI, the presence or absence of documentation
doesn't save us on that point (Linus has a few times made his position
wrt to documentation clear).

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-23 Thread Michael Kerrisk (man-pages)

On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:


Thanks Kees!

Cheers,

Michael



#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

    return 0;
}


-Kees






--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-23 Thread Michael Kerrisk (man-pages)

On 06/22/2016 11:11 PM, Kees Cook wrote:

On Wed, Jun 22, 2016 at 12:21 PM, Michael Kerrisk (man-pages)
 wrote:

On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages)
wrote:

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is permitted.  The  results
   depend on the LSM.  The implementation of this interface in
   the default LSM performs the following steps:



For people who are unaware of how the LSM API works, it might be good to
clarify that the commoncap LSM is *always* invoked; otherwise, it might
give the impression that using another LSM would replace it.



As we can see, I am one of those who are unaware of how the LSM API
works :-/.


(Also, are there other documents that refer to it as "default LSM"? I
think that that term is slightly confusing.)



No, that's a terminological confusion of my own making. Fixed now.

I changed this text to:

   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated by any enabled Linux Security Module (LSMs)—for
   example,  SELinux,  Yama, or Smack—and by the the commoncap LSM
   (which is always invoked).  Prior to  Linux  2.6.27,  all  such
   checks  were  of a single type.  Since Linux 2.6.27, two access
   mode levels are distinguished:

BTW, can you point me at the piece(s) of kernel code that show that
"commoncap" is always invoked in addition to any other LSM that has
been installed?


It's not entirely obvious, but the bottom of security/commoncap.c shows:


Thanks Kees!

Cheers,

Michael



#ifdef CONFIG_SECURITY

struct security_hook_list capability_hooks[] = {
LSM_HOOK_INIT(capable, cap_capable),
...
};

void __init capability_add_hooks(void)
{
security_add_hooks(capability_hooks, ARRAY_SIZE(capability_hooks));
}

#endif

And security/security.c shows the initialization order of the LSMs:

int __init security_init(void)
{
pr_info("Security Framework initialized\n");

/*
 * Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
yama_add_hooks();
loadpin_add_hooks();

/*
 * Load all the remaining security modules.
 */
do_security_initcalls();

    return 0;
}


-Kees






--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Documenting ptrace access mode checking

2016-06-22 Thread Michael Kerrisk (man-pages)

Hi Jann,


On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote:

Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.

Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.


As others have said, I'm surprised about seeing documentation about
kernel-internal constants in manpages - but I think it might be a good
thing to have there, given that people who look at ptrace(2) are likely
to be interested in low-level details.


I agree that it is a little surprising to add kernel-internal
constants in a man page. (There are precedents, but they are few.)
But see my reply to Kees. It's more than just explaining low level
details: there are various kinds of user-space behavior differences
(real vs filesystem credentials; permitted vs effective capabilities)
produced by the ptrace_may_access() checks, and those behaviors need
to be described and *somehow* labeled for cross-referencing from
other man pages.


[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)

   Since  Linux  4.5, the above access mode checks may be combined


s/may/must/; otherwise __ptrace_may_access() will yell about the kernel
code being broken and deny access.


Good point. I changed "may" to "are". ("must" is not quite right to my
"user-space" ear; it might be misread as implying that the user-space
developer must do something.)


   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]

   The  algorithm  employed for ptrace access mode checking deter‐
   mines whether the calling process is  allowed  to  perform  the
   corresponding action on the target process, as follows:

   1.  If the calling thread and the target thread are in the same
   thread group, access is always allowed.

   2.  If the access mode specifies PTRACE_MODE_FSCREDS, then  for
   the  check in the next step, employ the caller's filesystem
   user ID and group ID (see credentials(7));  otherwise  (the
   access  mode  specifies  PTRACE_MODE_REALCREDS, so) use the
   caller's real user ID and group ID.


Might want to add a "for historical reasons" or so here.


Re: Documenting ptrace access mode checking

2016-06-22 Thread Michael Kerrisk (man-pages)

Hi Jann,


On 06/21/2016 10:55 PM, Jann Horn wrote:

On Tue, Jun 21, 2016 at 11:41:16AM +0200, Michael Kerrisk (man-pages) wrote:

Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.

Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.


As others have said, I'm surprised about seeing documentation about
kernel-internal constants in manpages - but I think it might be a good
thing to have there, given that people who look at ptrace(2) are likely
to be interested in low-level details.


I agree that it is a little surprising to add kernel-internal
constants in a man page. (There are precedents, but they are few.)
But see my reply to Kees. It's more than just explaining low level
details: there are various kinds of user-space behavior differences
(real vs filesystem credentials; permitted vs effective capabilities)
produced by the ptrace_may_access() checks, and those behaviors need
to be described and *somehow* labeled for cross-referencing from
other man pages.


[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)

   Since  Linux  4.5, the above access mode checks may be combined


s/may/must/; otherwise __ptrace_may_access() will yell about the kernel
code being broken and deny access.


Good point. I changed "may" to "are". ("must" is not quite right to my
"user-space" ear; it might be misread as implying that the user-space
developer must do something.)


   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]

   The  algorithm  employed for ptrace access mode checking deter‐
   mines whether the calling process is  allowed  to  perform  the
   corresponding action on the target process, as follows:

   1.  If the calling thread and the target thread are in the same
   thread group, access is always allowed.

   2.  If the access mode specifies PTRACE_MODE_FSCREDS, then  for
   the  check in the next step, employ the caller's filesystem
   user ID and group ID (see credentials(7));  otherwise  (the
   access  mode  specifies  PTRACE_MODE_REALCREDS, so) use the
   caller's real user ID and group ID.


Might want to add a "for historical reasons" or so here.


Re: Documenting ptrace access mode checking

2016-06-22 Thread Michael Kerrisk (man-pages)

Hi Kees,

On 06/21/2016 10:29 PM, Kees Cook wrote:

On Tue, Jun 21, 2016 at 12:55 PM, Eric W. Biederman
<ebied...@xmission.com> wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.


Your text matches my understand of this code. :)


Thanks for reviewing the text!


Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.

[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)

   Since  Linux  4.5, the above access mode checks may be combined
   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]


 AKA don't let the audit subsystem know.  Which tends to
 generate audit records capable is called.


   The  algorithm  employed for ptrace access mode checking deter‐
   mines whether the calling process is  allowed  to  perform  the
   corresponding action on the target process, as follows:

   1.  If the calling thread and the target thread are in the same
   thread group, access is always allowed.


This test only exsits because the LSMs historically and I suspect
continue to be broken and deny a process the ability to ptrace itself.


Well, it's not that the LSMs are broken, it's that self-inspection is
a short-circuited "allow". The LSMs aren't involved.


   2.  If the access mode specifies PTRACE_MODE_FSCREDS, then  for
   the  check in the next step, employ the caller's filesystem
   user ID and group ID (see credentials(7));  otherwise  (the
   access  mode  specifies  PTRACE_MODE_REALCREDS, so) use the
   caller's real user ID and group ID.

   3.  Deny access if neither of the following is true:

   · The real, effective, and saved-set user IDs of the target
 match  the caller's user ID, and the real, effective, and
 saved-set group IDs of  the  target  match  the  caller's
 group ID.

   · The caller has the CAP_SYS_PTRACE capability.

   4.  Deny  access if the target process "dumpable" attribute has
   a valu

Re: Documenting ptrace access mode checking

2016-06-22 Thread Michael Kerrisk (man-pages)

Hi Kees,

On 06/21/2016 10:29 PM, Kees Cook wrote:

On Tue, Jun 21, 2016 at 12:55 PM, Eric W. Biederman
 wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.

"Michael Kerrisk (man-pages)"  writes:


Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.


Your text matches my understand of this code. :)


Thanks for reviewing the text!


Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.

[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)

   Since  Linux  4.5, the above access mode checks may be combined
   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]


 AKA don't let the audit subsystem know.  Which tends to
 generate audit records capable is called.


   The  algorithm  employed for ptrace access mode checking deter‐
   mines whether the calling process is  allowed  to  perform  the
   corresponding action on the target process, as follows:

   1.  If the calling thread and the target thread are in the same
   thread group, access is always allowed.


This test only exsits because the LSMs historically and I suspect
continue to be broken and deny a process the ability to ptrace itself.


Well, it's not that the LSMs are broken, it's that self-inspection is
a short-circuited "allow". The LSMs aren't involved.


   2.  If the access mode specifies PTRACE_MODE_FSCREDS, then  for
   the  check in the next step, employ the caller's filesystem
   user ID and group ID (see credentials(7));  otherwise  (the
   access  mode  specifies  PTRACE_MODE_REALCREDS, so) use the
   caller's real user ID and group ID.

   3.  Deny access if neither of the following is true:

   · The real, effective, and saved-set user IDs of the target
 match  the caller's user ID, and the real, effective, and
 saved-set group IDs of  the  target  match  the  caller's
 group ID.

   · The caller has the CAP_SYS_PTRACE capability.

   4.  Deny  access if the target process "dumpable" attribute has
   a value other than 1 (SUID_DUMP_USER; see the discussion of
   PR_SET_DU

Re: Documenting ptrace access mode checking

2016-06-22 Thread Michael Kerrisk (man-pages)

Hi Eric,

On 06/21/2016 09:55 PM, Eric W. Biederman wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.

"Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:


Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.

Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.

[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)

   Since  Linux  4.5, the above access mode checks may be combined
   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]


 AKA don't let the audit subsystem know.  Which tends to
 generate audit records capable is called.


   The  algorithm  employed for ptrace access mode checking deter‐
   mines whether the calling process is  allowed  to  perform  the
   corresponding action on the target process, as follows:

   1.  If the calling thread and the target thread are in the same
   thread group, access is always allowed.


This test only exsits because the LSMs historically and I suspect
continue to be broken and deny a process the ability to ptrace itself.


   2.  If the access mode specifies PTRACE_MODE_FSCREDS, then  for
   the  check in the next step, employ the caller's filesystem
   user ID and group ID (see credentials(7));  otherwise  (the
   access  mode  specifies  PTRACE_MODE_REALCREDS, so) use the
   caller's real user ID and group ID.

   3.  Deny access if neither of the following is true:

   · The real, effective, and saved-set user IDs of the target
 match  the caller's user ID, and the real, effective, and
 saved-set group IDs of  the  target  match  the  caller's
 group ID.

   · The caller has the CAP_SYS_PTRACE capability.

   4.  Deny  access if the target process "dumpable" attribute has
   a value other than 1 (SUID_DUMP_USER; see the discussion of
   PR_SET_DUMPABLE  in prctl(2)), and the caller does not have
   the CAP_SYS_PTRACE capability in the user namespace of  the
   target process.

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to s

Re: Documenting ptrace access mode checking

2016-06-22 Thread Michael Kerrisk (man-pages)

Hi Eric,

On 06/21/2016 09:55 PM, Eric W. Biederman wrote:


Adding Oleg just because he seems to do most of the ptrace related
maintenance these days.

"Michael Kerrisk (man-pages)"  writes:


Hi Jann, Stephen, et al.

Jann, since you recently committed a patch in this area, and Stephen,
since you committed 006ebb40d3d much further back in time, I wonder if
you might help me by reviewing the text below that I propose to add to
the ptrace(2) man page, in order to document "ptrace access mode
checking" that is performed in various parts of the kernel-user-space
interface. Of course, I welcome input from anyone else as well.

Here's the new ptrace(2) text. Any comments, technical or terminological
fixes, other improvements, etc. are welcome.

[[
   Ptrace access mode checking
   Various parts of the kernel-user-space API (not just  ptrace(2)
   operations), require so-called "ptrace access mode permissions"
   which are gated  by  Linux  Security  Modules  (LSMs)  such  as
   SELinux,  Yama,  Smack,  or  the  default  LSM.  Prior to Linux
   2.6.27, all such checks were of a  single  type.   Since  Linux
   2.6.27, two access mode levels are distinguished:

   PTRACE_MODE_READ
  For  "read" operations or other operations that are less
  dangerous, such as: get_robust_list(2); kcmp(2); reading
  /proc/[pid]/auxv, /proc/[pid]/environ,or
  /proc/[pid]/stat; or readlink(2) of  a  /proc/[pid]/ns/*
  file.

   PTRACE_MODE_ATTACH
  For  "write"  operations,  or  other operations that are
  moredangerous,suchas:ptraceattaching
  (PTRACE_ATTACH)to   another   process   or   calling
  process_vm_writev(2).   (PTRACE_MODE_ATTACH  was  effec‐
  tively the default before Linux 2.6.27.)

   Since  Linux  4.5, the above access mode checks may be combined
   (ORed) with one of the following modifiers:

   PTRACE_MODE_FSCREDS
  Use the caller's filesystem UID  and  GID  (see  creden‐
  tials(7)) or effective capabilities for LSM checks.

   PTRACE_MODE_REALCREDS
  Use the caller's real UID and GID or permitted capabili‐
  ties for LSM checks.  This was effectively  the  default
  before Linux 4.5.

   Because  combining  one of the credential modifiers with one of
   the aforementioned access modes is  typical,  some  macros  are
   defined in the kernel sources for the combinations:

   PTRACE_MODE_READ_FSCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_READ_REALCREDS
  Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

   PTRACE_MODE_ATTACH_FSCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

   PTRACE_MODE_ATTACH_REALCREDS
  Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

   One further modifier can be ORed with the access mode:

   PTRACE_MODE_NOAUDIT (since Linux 3.3)
  Don't audit this access mode check.

[I'd quite welcome some text to explain "auditing" here.]


 AKA don't let the audit subsystem know.  Which tends to
 generate audit records capable is called.


   The  algorithm  employed for ptrace access mode checking deter‐
   mines whether the calling process is  allowed  to  perform  the
   corresponding action on the target process, as follows:

   1.  If the calling thread and the target thread are in the same
   thread group, access is always allowed.


This test only exsits because the LSMs historically and I suspect
continue to be broken and deny a process the ability to ptrace itself.


   2.  If the access mode specifies PTRACE_MODE_FSCREDS, then  for
   the  check in the next step, employ the caller's filesystem
   user ID and group ID (see credentials(7));  otherwise  (the
   access  mode  specifies  PTRACE_MODE_REALCREDS, so) use the
   caller's real user ID and group ID.

   3.  Deny access if neither of the following is true:

   · The real, effective, and saved-set user IDs of the target
 match  the caller's user ID, and the real, effective, and
 saved-set group IDs of  the  target  match  the  caller's
 group ID.

   · The caller has the CAP_SYS_PTRACE capability.

   4.  Deny  access if the target process "dumpable" attribute has
   a value other than 1 (SUID_DUMP_USER; see the discussion of
   PR_SET_DUMPABLE  in prctl(2)), and the caller does not have
   the CAP_SYS_PTRACE capability in the user namespace of  the
   target process.

   5.  The  kernel LSM security_ptrace_access_check() interface is
   invoked to see if ptrace access is perm

Documenting ptrace access mode checking

2016-06-21 Thread Michael Kerrisk (man-pages)
  b) Deny access if neither of the following is true:

  · The caller's capabilities are a proper superset of the
target process's permitted capabilities.

  · The  caller  has  the CAP_SYS_PTRACE capability in the
target process's user namespace.

  Note that the default LSM does not  distinguish  between
  PTRACE_MODE_READ and PTRACE_MODE_ATTACH.

   6.  If  access  has  not  been  denied  by any of the preceding
   steps, then access is allowed.
]]

There are accompanying changes to various pages that refer to 
the new text in ptrace(2), so that, for example, kcmp(2) adds:

   Permission  to  employ kcmp() is governed by ptrace access mode
   PTRACE_MODE_ATTACH_REALCREDS checks against both pid1 and pid2;
   see ptrace(2).

and proc.5 has additions such as:

   /proc/[pid]/auxv (since 2.6.0-test7)
  ...
  Permission to access this file is governed by  a  ptrace
  accessmode   PTRACE_MODE_READ_FSCREDS   check;   see
  ptrace(2).

   /proc/[pid]/cwd
  ...
  Permission to dereference  or  read  (readlink(2))  this
  symbolic  link  is  governed  by  a  ptrace  access mode
  PTRACE_MODE_READ_FSCREDS check; see ptrace(2).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Documenting ptrace access mode checking

2016-06-21 Thread Michael Kerrisk (man-pages)
  b) Deny access if neither of the following is true:

  · The caller's capabilities are a proper superset of the
target process's permitted capabilities.

  · The  caller  has  the CAP_SYS_PTRACE capability in the
target process's user namespace.

  Note that the default LSM does not  distinguish  between
  PTRACE_MODE_READ and PTRACE_MODE_ATTACH.

   6.  If  access  has  not  been  denied  by any of the preceding
   steps, then access is allowed.
]]

There are accompanying changes to various pages that refer to 
the new text in ptrace(2), so that, for example, kcmp(2) adds:

   Permission  to  employ kcmp() is governed by ptrace access mode
   PTRACE_MODE_ATTACH_REALCREDS checks against both pid1 and pid2;
   see ptrace(2).

and proc.5 has additions such as:

   /proc/[pid]/auxv (since 2.6.0-test7)
  ...
  Permission to access this file is governed by  a  ptrace
  accessmode   PTRACE_MODE_READ_FSCREDS   check;   see
  ptrace(2).

   /proc/[pid]/cwd
  ...
  Permission to dereference  or  read  (readlink(2))  this
  symbolic  link  is  governed  by  a  ptrace  access mode
  PTRACE_MODE_READ_FSCREDS check; see ptrace(2).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 0/9] [v2] System Calls for Memory Protection Keys

2016-06-08 Thread Michael Kerrisk (man-pages)
On 06/07/2016 10:47 PM, Dave Hansen wrote:
> Are there any concerns with merging these into the x86 tree so
> that they go upstream for 4.8?  

I believe we still don't have up-to-date man pages, right?
Best from my POV to send them out in parallel with the 
implementation.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 0/9] [v2] System Calls for Memory Protection Keys

2016-06-08 Thread Michael Kerrisk (man-pages)
On 06/07/2016 10:47 PM, Dave Hansen wrote:
> Are there any concerns with merging these into the x86 tree so
> that they go upstream for 4.8?  

I believe we still don't have up-to-date man pages, right?
Best from my POV to send them out in parallel with the 
implementation.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls

2016-06-03 Thread Michael Kerrisk (man-pages)
On 06/03/2016 12:28 PM, Dave Hansen wrote:
> On 06/02/2016 05:26 PM, Michael Kerrisk (man-pages) wrote:
>> On 06/01/2016 07:17 PM, Dave Hansen wrote:
>>> On 06/01/2016 05:11 PM, Michael Kerrisk (man-pages) wrote:
>>>>>>>>
>>>>>>>> If I read this right, it doesn't actually remove any pkey restrictions
>>>>>>>> that may have been applied while the key was allocated.  So there 
>>>>>>>> could be
>>>>>>>> pages with that key assigned that might do surprising things if the 
>>>>>>>> key is
>>>>>>>> reallocated for another use later, right?  Is that how the API is 
>>>>>>>> intended
>>>>>>>> to work?
>>>>>>
>>>>>> Yeah, that's how it works.
>>>>>>
>>>>>> It's not ideal.  It would be _best_ if we during mm_pkey_free(), we
>>>>>> ensured that no VMAs under that mm have that vma_pkey() set.  But, that
>>>>>> search would be potentially expensive (a walk over all VMAs), or would
>>>>>> force us to keep a data structure with a count of all the VMAs with a
>>>>>> given key.
>>>>>>
>>>>>> I should probably discuss this behavior in the manpages and address it
>>>> s/probably//
>>>>
>>>> And, did I miss it. Was there an updated man-pages patch in the latest
>>>> series? I did not notice it.
>>>
>>> There have been to changes to the patches that warranted updating the
>>> manpages until now.  I'll send the update immediately.
>>
>> Do those updated pages include discussion of the point noted above?
>> I could not see it mentioned there.
> 
> I added the following text to pkey_alloc.2.  I somehow neglected to send
> it out in the v3 update of the manpages RFC:
> 
> An application should not call
> .BR pkey_free ()
> on any protection key which has been assigned to an address
> range by
> .BR pkey_mprotect ()
> and which is still in use.  The behavior in this case is
> undefined and may result in an error.
> 
> I'll add that in the version (v4) I send out shortly.
> 
>> Just by the way, the above behavior seems to offer possibilities
>> for users to shoot themselves in the foot, in a way that has security
>> implications. (Or do I misunderstand?)
> 
> Protection keys has the potential to add a layer of security and
> reliability to applications.  But, it has not been primarily designed as
> a security feature.  For instance, WRPKRU is a completely unprivileged
> instruction, so pkeys are useless in any case that an attacker controls
> the PKRU register or can execute arbitrary instructions.
> 
> That said, this mechanism does, indeed, allow a user to shoot themselves
> in the foot and in a way that could have security implications.
> 
> For instance, say the following happened:
> 1. A sensitive bit of data in memory was marked with a pkey
> 2. That pkey was set as PKEY_DISABLE_ACCESS
> 3. The application called pkey_free() on the pkey, without freeing
>the sensitive data
> 4. Application calls pkey_alloc() and then clears PKEY_DISABLE_ACCESS
> 5. Applocation can now read the sensitive data
> 
> The application has to have basically "leaked" a reference to the pkey.
>  It forgot that it had sensitive data marked with that key.
> 
> The kernel _could_ enforce that no in-use pkey may have pkey_free()
> called on it.  But, doing that has tradeoffs which could make
> pkey_free() extremely slow:
> 
>> It's not ideal.  It would be _best_ if we during mm_pkey_free(), we
>> ensured that no VMAs under that mm have that vma_pkey() set.  But, that
>> search would be potentially expensive (a walk over all VMAs), or would
>> force us to keep a data structure with a count of all the VMAs with a
>> given key.
> 
> In addition, that checking _could_ be implemented in an application by
> inspecting /proc/$pid/smaps for "ProtectionKey: $foo" before calling
> pkey_free($foo).

So, I think all of the above needs to be made abundantly clear in 
pkeys(7).

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls

2016-06-03 Thread Michael Kerrisk (man-pages)
On 06/03/2016 12:28 PM, Dave Hansen wrote:
> On 06/02/2016 05:26 PM, Michael Kerrisk (man-pages) wrote:
>> On 06/01/2016 07:17 PM, Dave Hansen wrote:
>>> On 06/01/2016 05:11 PM, Michael Kerrisk (man-pages) wrote:
>>>>>>>>
>>>>>>>> If I read this right, it doesn't actually remove any pkey restrictions
>>>>>>>> that may have been applied while the key was allocated.  So there 
>>>>>>>> could be
>>>>>>>> pages with that key assigned that might do surprising things if the 
>>>>>>>> key is
>>>>>>>> reallocated for another use later, right?  Is that how the API is 
>>>>>>>> intended
>>>>>>>> to work?
>>>>>>
>>>>>> Yeah, that's how it works.
>>>>>>
>>>>>> It's not ideal.  It would be _best_ if we during mm_pkey_free(), we
>>>>>> ensured that no VMAs under that mm have that vma_pkey() set.  But, that
>>>>>> search would be potentially expensive (a walk over all VMAs), or would
>>>>>> force us to keep a data structure with a count of all the VMAs with a
>>>>>> given key.
>>>>>>
>>>>>> I should probably discuss this behavior in the manpages and address it
>>>> s/probably//
>>>>
>>>> And, did I miss it. Was there an updated man-pages patch in the latest
>>>> series? I did not notice it.
>>>
>>> There have been to changes to the patches that warranted updating the
>>> manpages until now.  I'll send the update immediately.
>>
>> Do those updated pages include discussion of the point noted above?
>> I could not see it mentioned there.
> 
> I added the following text to pkey_alloc.2.  I somehow neglected to send
> it out in the v3 update of the manpages RFC:
> 
> An application should not call
> .BR pkey_free ()
> on any protection key which has been assigned to an address
> range by
> .BR pkey_mprotect ()
> and which is still in use.  The behavior in this case is
> undefined and may result in an error.
> 
> I'll add that in the version (v4) I send out shortly.
> 
>> Just by the way, the above behavior seems to offer possibilities
>> for users to shoot themselves in the foot, in a way that has security
>> implications. (Or do I misunderstand?)
> 
> Protection keys has the potential to add a layer of security and
> reliability to applications.  But, it has not been primarily designed as
> a security feature.  For instance, WRPKRU is a completely unprivileged
> instruction, so pkeys are useless in any case that an attacker controls
> the PKRU register or can execute arbitrary instructions.
> 
> That said, this mechanism does, indeed, allow a user to shoot themselves
> in the foot and in a way that could have security implications.
> 
> For instance, say the following happened:
> 1. A sensitive bit of data in memory was marked with a pkey
> 2. That pkey was set as PKEY_DISABLE_ACCESS
> 3. The application called pkey_free() on the pkey, without freeing
>the sensitive data
> 4. Application calls pkey_alloc() and then clears PKEY_DISABLE_ACCESS
> 5. Applocation can now read the sensitive data
> 
> The application has to have basically "leaked" a reference to the pkey.
>  It forgot that it had sensitive data marked with that key.
> 
> The kernel _could_ enforce that no in-use pkey may have pkey_free()
> called on it.  But, doing that has tradeoffs which could make
> pkey_free() extremely slow:
> 
>> It's not ideal.  It would be _best_ if we during mm_pkey_free(), we
>> ensured that no VMAs under that mm have that vma_pkey() set.  But, that
>> search would be potentially expensive (a walk over all VMAs), or would
>> force us to keep a data structure with a count of all the VMAs with a
>> given key.
> 
> In addition, that checking _could_ be implemented in an application by
> inspecting /proc/$pid/smaps for "ProtectionKey: $foo" before calling
> pkey_free($foo).

So, I think all of the above needs to be made abundantly clear in 
pkeys(7).

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls

2016-06-02 Thread Michael Kerrisk (man-pages)
On 06/01/2016 07:17 PM, Dave Hansen wrote:
> On 06/01/2016 05:11 PM, Michael Kerrisk (man-pages) wrote:
>>>>>>
>>>>>> If I read this right, it doesn't actually remove any pkey restrictions
>>>>>> that may have been applied while the key was allocated.  So there could 
>>>>>> be
>>>>>> pages with that key assigned that might do surprising things if the key 
>>>>>> is
>>>>>> reallocated for another use later, right?  Is that how the API is 
>>>>>> intended
>>>>>> to work?
>>>>
>>>> Yeah, that's how it works.
>>>>
>>>> It's not ideal.  It would be _best_ if we during mm_pkey_free(), we
>>>> ensured that no VMAs under that mm have that vma_pkey() set.  But, that
>>>> search would be potentially expensive (a walk over all VMAs), or would
>>>> force us to keep a data structure with a count of all the VMAs with a
>>>> given key.
>>>>
>>>> I should probably discuss this behavior in the manpages and address it
>> s/probably//
>>
>> And, did I miss it. Was there an updated man-pages patch in the latest
>> series? I did not notice it.
> 
> There have been to changes to the patches that warranted updating the
> manpages until now.  I'll send the update immediately.

Do those updated pages include discussion of the point noted above?
I could not see it mentioned there.

Just by the way, the above behavior seems to offer possibilities
for users to shoot themselves in the foot, in a way that has security
implications. (Or do I misunderstand?)

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls

2016-06-02 Thread Michael Kerrisk (man-pages)
On 06/01/2016 07:17 PM, Dave Hansen wrote:
> On 06/01/2016 05:11 PM, Michael Kerrisk (man-pages) wrote:
>>>>>>
>>>>>> If I read this right, it doesn't actually remove any pkey restrictions
>>>>>> that may have been applied while the key was allocated.  So there could 
>>>>>> be
>>>>>> pages with that key assigned that might do surprising things if the key 
>>>>>> is
>>>>>> reallocated for another use later, right?  Is that how the API is 
>>>>>> intended
>>>>>> to work?
>>>>
>>>> Yeah, that's how it works.
>>>>
>>>> It's not ideal.  It would be _best_ if we during mm_pkey_free(), we
>>>> ensured that no VMAs under that mm have that vma_pkey() set.  But, that
>>>> search would be potentially expensive (a walk over all VMAs), or would
>>>> force us to keep a data structure with a count of all the VMAs with a
>>>> given key.
>>>>
>>>> I should probably discuss this behavior in the manpages and address it
>> s/probably//
>>
>> And, did I miss it. Was there an updated man-pages patch in the latest
>> series? I did not notice it.
> 
> There have been to changes to the patches that warranted updating the
> manpages until now.  I'll send the update immediately.

Do those updated pages include discussion of the point noted above?
I could not see it mentioned there.

Just by the way, the above behavior seems to offer possibilities
for users to shoot themselves in the foot, in a way that has security
implications. (Or do I misunderstand?)

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls

2016-06-01 Thread Michael Kerrisk (man-pages)
Hi Dave,

On 1 June 2016 at 14:32, Dave Hansen <d...@sr71.net> wrote:
> On 06/01/2016 11:37 AM, Jonathan Corbet wrote:
>>> +static inline
>>> +int mm_pkey_free(struct mm_struct *mm, int pkey)
>>> +{
>>> +/*
>>> + * pkey 0 is special, always allocated and can never
>>> + * be freed.
>>> + */
>>> +if (!pkey || !validate_pkey(pkey))
>>> +return -EINVAL;
>>> +if (!mm_pkey_is_allocated(mm, pkey))
>>> +return -EINVAL;
>>> +
>>> +mm_set_pkey_free(mm, pkey);
>>> +
>>> +return 0;
>>> +}
>>
>> If I read this right, it doesn't actually remove any pkey restrictions
>> that may have been applied while the key was allocated.  So there could be
>> pages with that key assigned that might do surprising things if the key is
>> reallocated for another use later, right?  Is that how the API is intended
>> to work?
>
> Yeah, that's how it works.
>
> It's not ideal.  It would be _best_ if we during mm_pkey_free(), we
> ensured that no VMAs under that mm have that vma_pkey() set.  But, that
> search would be potentially expensive (a walk over all VMAs), or would
> force us to keep a data structure with a count of all the VMAs with a
> given key.
>
> I should probably discuss this behavior in the manpages and address it

s/probably//

And, did I miss it. Was there an updated man-pages patch in the latest
series? I did not notice it.

> more directly in the changelog for this patch.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 5/8] x86, pkeys: allocation/free syscalls

2016-06-01 Thread Michael Kerrisk (man-pages)
Hi Dave,

On 1 June 2016 at 14:32, Dave Hansen  wrote:
> On 06/01/2016 11:37 AM, Jonathan Corbet wrote:
>>> +static inline
>>> +int mm_pkey_free(struct mm_struct *mm, int pkey)
>>> +{
>>> +/*
>>> + * pkey 0 is special, always allocated and can never
>>> + * be freed.
>>> + */
>>> +if (!pkey || !validate_pkey(pkey))
>>> +return -EINVAL;
>>> +if (!mm_pkey_is_allocated(mm, pkey))
>>> +return -EINVAL;
>>> +
>>> +mm_set_pkey_free(mm, pkey);
>>> +
>>> +return 0;
>>> +}
>>
>> If I read this right, it doesn't actually remove any pkey restrictions
>> that may have been applied while the key was allocated.  So there could be
>> pages with that key assigned that might do surprising things if the key is
>> reallocated for another use later, right?  Is that how the API is intended
>> to work?
>
> Yeah, that's how it works.
>
> It's not ideal.  It would be _best_ if we during mm_pkey_free(), we
> ensured that no VMAs under that mm have that vma_pkey() set.  But, that
> search would be potentially expensive (a walk over all VMAs), or would
> force us to keep a data structure with a count of all the VMAs with a
> given key.
>
> I should probably discuss this behavior in the manpages and address it

s/probably//

And, did I miss it. Was there an updated man-pages patch in the latest
series? I did not notice it.

> more directly in the changelog for this patch.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Mount namespace "dominant peer group"?

2016-05-24 Thread Michael Kerrisk (man-pages)
On 05/23/2016 02:55 AM, Miklos Szeredi wrote:
> C is slave of B is slave of A.  If a process can see (i.e. has under
> its root) A and C but not B then for C it will show
> master:B,propagate_from:A.  This piece of information is shown because
> it can't see the immediate master (B) and so cannot determine the
> chain of propagation between the mounts it can see.

Thanks, Miklos!

> Concrete example:

Yep, that does it. Thanks for the walk through!

One piece missing below though, in case anyone else tries to walk
through.

> # mount --bind / /mnt
> # mount --bind /proc /mnt/proc
> # mount --make-private /mnt
> # mount --make-shared /mnt
> # mkdir /tmp/etc
> # mount --bind /mnt/etc /tmp/etc
> # mount --make-slave /tmp/etc
> # mount --make-shared /tmp/etc

# mkdir /mnt/tmp/etc

> # mount --bind /tmp/etc /mnt/tmp/etc
> # mount --make-slave /mnt/tmp/etc
> # cat /proc/self/mountinfo | grep /tmp/etc
> 164 40 253:1 /etc /tmp/etc rw,relatime shared:100 master:97 - ...
> # chroot /mnt
> # cat /proc/self/mountinfo
> 129 62 253:1 / / rw,relatime shared:97 - ...
> 168 129 253:1 /etc /tmp/etc rw,relatime master:100 propagate_from:97 - ...

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Mount namespace "dominant peer group"?

2016-05-24 Thread Michael Kerrisk (man-pages)
On 05/23/2016 02:55 AM, Miklos Szeredi wrote:
> C is slave of B is slave of A.  If a process can see (i.e. has under
> its root) A and C but not B then for C it will show
> master:B,propagate_from:A.  This piece of information is shown because
> it can't see the immediate master (B) and so cannot determine the
> chain of propagation between the mounts it can see.

Thanks, Miklos!

> Concrete example:

Yep, that does it. Thanks for the walk through!

One piece missing below though, in case anyone else tries to walk
through.

> # mount --bind / /mnt
> # mount --bind /proc /mnt/proc
> # mount --make-private /mnt
> # mount --make-shared /mnt
> # mkdir /tmp/etc
> # mount --bind /mnt/etc /tmp/etc
> # mount --make-slave /tmp/etc
> # mount --make-shared /tmp/etc

# mkdir /mnt/tmp/etc

> # mount --bind /tmp/etc /mnt/tmp/etc
> # mount --make-slave /mnt/tmp/etc
> # cat /proc/self/mountinfo | grep /tmp/etc
> 164 40 253:1 /etc /tmp/etc rw,relatime shared:100 master:97 - ...
> # chroot /mnt
> # cat /proc/self/mountinfo
> 129 62 253:1 / / rw,relatime shared:97 - ...
> 168 129 253:1 /etc /tmp/etc rw,relatime master:100 propagate_from:97 - ...

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Mount namespace "dominant peer group"?

2016-05-21 Thread Michael Kerrisk (man-pages)
Hello Ram,

On 05/20/2016 06:15 PM, Ram Pai wrote:
> On Fri, May 20, 2016 at 04:24:18PM -0500, Michael Kerrisk (man-pages) wrote:
>> Hello Miklos,
>>
>> I'm working on some better documentation of mount namespaces,
>> and there's a detail that puzzles me, and I hope you might be 
>> able to help, since you added the detail...
>>
>> In Documentation/filesystems/proc.txt there is this text in the
>> description of /proc/PID/mountinfo:
>>
>> [[
>> Parsers should ignore all unrecognised optional fields.  Currently the
>> possible optional fields are:
>>
>> shared:X  mount is shared in peer group X
>> master:X  mount is slave to peer group X
>> propagate_from:X  mount is slave and receives propagation from peer group X 
>> (*)
>> unbindable  mount is unbindable
>> 
>> (*) X is the closest dominant peer group under the process's root.  If
>> X is the immediate master of the mount, or if there's no dominant peer 
>> group under the same root, then only the "master:X" field is present
>> and not the "propagate_from:X" field.
>> ]]
>>
>> What is a dominant peer group, as distinct from the immediate master?
>>
>> I can see in fs/proc_namespaces.c that there is this distinction made:
>>
>> [[
>> /* Tagged fields ("foo:X" or "bar") */
>> if (IS_MNT_SHARED(r))
>> seq_printf(m, " shared:%i", r->mnt_group_id);
>> if (IS_MNT_SLAVE(r)) {
>> int master = r->mnt_master->mnt_group_id;
>> int dom = get_dominating_id(r, >root);
>> seq_printf(m, " master:%i", master);
>> if (dom && dom != master)
>> seq_printf(m, " propagate_from:%i", dom);
>> }
>> ]]
>>
>> But I can't relate that to some user-space semantics. I suppose another
>> way of asking my question is: how could I create a slave that is
>> propagating from a peer group other than it's immediate master?
> 
> It can happen if you have unmounted or privatised all your master mounts from 
> the peer group.
> 
> Eg:
> 
> mount /dev/xyz  /1#creates a new mount
> mount --make-private /1   #just make sure that it does not receive or send 
> and propogation
> mount --make-shared /1#now make it shared.
> mount --bind /1 /2  #create a peer /1 and /2 are peers
> create a new fs-namespace. this new fs-namespace which will have /1' and /2'. 
> /1 /2 /1' /2' are now all part of the same peergroup.
> mount --make-slave /2 # this will make /2 a slave of the peer group that 
> contains /1 /1' and /2'
> umount /1  # we now have /2 which receives propagation from a peer group 
> which does not have a representative in its fs-namespace.

Thanks for the note. However, doing the above, I still do not
see any mount being marked with 'propagate_from'. Perhaps I
misunderstood your instructions above. Here's what I did:

sh1# mount --make-private /  # Make share everything is private...
sh1# mount /dev/sdb6 /1
sh1# mount --make-private /1
sh1# mount --make-shared /1
sh1# mount --bind /1 /2
sh1# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//'
81 61 8:22 / /1 rw,relatime shared:1
82 61 8:22 / /2 rw,relatime shared:1

Then, at a second terminal, create a new mount NS:

sh2# unshare -m --propagation unchanged sh
sh2# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//'
169 132 8:22 / /1 rw,relatime shared:1
170 132 8:22 / /2 rw,relatime shared:1

Returning to the first terminal:

sh1# mount --make-slave /2
sh1# umount /1
sh1# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//'
82 61 8:22 / /2 rw,relatime master:1

That is, we see /2 in the initial mount namespace is a slave
but there is no 'propagate_from' tag. Did I miss something?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Mount namespace "dominant peer group"?

2016-05-21 Thread Michael Kerrisk (man-pages)
Hello Ram,

On 05/20/2016 06:15 PM, Ram Pai wrote:
> On Fri, May 20, 2016 at 04:24:18PM -0500, Michael Kerrisk (man-pages) wrote:
>> Hello Miklos,
>>
>> I'm working on some better documentation of mount namespaces,
>> and there's a detail that puzzles me, and I hope you might be 
>> able to help, since you added the detail...
>>
>> In Documentation/filesystems/proc.txt there is this text in the
>> description of /proc/PID/mountinfo:
>>
>> [[
>> Parsers should ignore all unrecognised optional fields.  Currently the
>> possible optional fields are:
>>
>> shared:X  mount is shared in peer group X
>> master:X  mount is slave to peer group X
>> propagate_from:X  mount is slave and receives propagation from peer group X 
>> (*)
>> unbindable  mount is unbindable
>> 
>> (*) X is the closest dominant peer group under the process's root.  If
>> X is the immediate master of the mount, or if there's no dominant peer 
>> group under the same root, then only the "master:X" field is present
>> and not the "propagate_from:X" field.
>> ]]
>>
>> What is a dominant peer group, as distinct from the immediate master?
>>
>> I can see in fs/proc_namespaces.c that there is this distinction made:
>>
>> [[
>> /* Tagged fields ("foo:X" or "bar") */
>> if (IS_MNT_SHARED(r))
>> seq_printf(m, " shared:%i", r->mnt_group_id);
>> if (IS_MNT_SLAVE(r)) {
>> int master = r->mnt_master->mnt_group_id;
>> int dom = get_dominating_id(r, >root);
>> seq_printf(m, " master:%i", master);
>> if (dom && dom != master)
>> seq_printf(m, " propagate_from:%i", dom);
>> }
>> ]]
>>
>> But I can't relate that to some user-space semantics. I suppose another
>> way of asking my question is: how could I create a slave that is
>> propagating from a peer group other than it's immediate master?
> 
> It can happen if you have unmounted or privatised all your master mounts from 
> the peer group.
> 
> Eg:
> 
> mount /dev/xyz  /1#creates a new mount
> mount --make-private /1   #just make sure that it does not receive or send 
> and propogation
> mount --make-shared /1#now make it shared.
> mount --bind /1 /2  #create a peer /1 and /2 are peers
> create a new fs-namespace. this new fs-namespace which will have /1' and /2'. 
> /1 /2 /1' /2' are now all part of the same peergroup.
> mount --make-slave /2 # this will make /2 a slave of the peer group that 
> contains /1 /1' and /2'
> umount /1  # we now have /2 which receives propagation from a peer group 
> which does not have a representative in its fs-namespace.

Thanks for the note. However, doing the above, I still do not
see any mount being marked with 'propagate_from'. Perhaps I
misunderstood your instructions above. Here's what I did:

sh1# mount --make-private /  # Make share everything is private...
sh1# mount /dev/sdb6 /1
sh1# mount --make-private /1
sh1# mount --make-shared /1
sh1# mount --bind /1 /2
sh1# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//'
81 61 8:22 / /1 rw,relatime shared:1
82 61 8:22 / /2 rw,relatime shared:1

Then, at a second terminal, create a new mount NS:

sh2# unshare -m --propagation unchanged sh
sh2# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//'
169 132 8:22 / /1 rw,relatime shared:1
170 132 8:22 / /2 rw,relatime shared:1

Returning to the first terminal:

sh1# mount --make-slave /2
sh1# umount /1
sh1# cat /proc/self/mountinfo | grep '/[12] ' | sed 's/ - .*//'
82 61 8:22 / /2 rw,relatime master:1

That is, we see /2 in the initial mount namespace is a slave
but there is no 'propagate_from' tag. Did I miss something?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Mount namespace "dominant peer group"?

2016-05-20 Thread Michael Kerrisk (man-pages)
Hello Miklos,

I'm working on some better documentation of mount namespaces,
and there's a detail that puzzles me, and I hope you might be 
able to help, since you added the detail...

In Documentation/filesystems/proc.txt there is this text in the
description of /proc/PID/mountinfo:

[[
Parsers should ignore all unrecognised optional fields.  Currently the
possible optional fields are:

shared:X  mount is shared in peer group X
master:X  mount is slave to peer group X
propagate_from:X  mount is slave and receives propagation from peer group X (*)
unbindable  mount is unbindable

(*) X is the closest dominant peer group under the process's root.  If
X is the immediate master of the mount, or if there's no dominant peer 
group under the same root, then only the "master:X" field is present
and not the "propagate_from:X" field.
]]

What is a dominant peer group, as distinct from the immediate master?

I can see in fs/proc_namespaces.c that there is this distinction made:

[[
/* Tagged fields ("foo:X" or "bar") */
if (IS_MNT_SHARED(r))
seq_printf(m, " shared:%i", r->mnt_group_id);
if (IS_MNT_SLAVE(r)) {
int master = r->mnt_master->mnt_group_id;
int dom = get_dominating_id(r, >root);
seq_printf(m, " master:%i", master);
if (dom && dom != master)
seq_printf(m, " propagate_from:%i", dom);
}
]]

But I can't relate that to some user-space semantics. I suppose another
way of asking my question is: how could I create a slave that is
propagating from a peer group other than it's immediate master?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Mount namespace "dominant peer group"?

2016-05-20 Thread Michael Kerrisk (man-pages)
Hello Miklos,

I'm working on some better documentation of mount namespaces,
and there's a detail that puzzles me, and I hope you might be 
able to help, since you added the detail...

In Documentation/filesystems/proc.txt there is this text in the
description of /proc/PID/mountinfo:

[[
Parsers should ignore all unrecognised optional fields.  Currently the
possible optional fields are:

shared:X  mount is shared in peer group X
master:X  mount is slave to peer group X
propagate_from:X  mount is slave and receives propagation from peer group X (*)
unbindable  mount is unbindable

(*) X is the closest dominant peer group under the process's root.  If
X is the immediate master of the mount, or if there's no dominant peer 
group under the same root, then only the "master:X" field is present
and not the "propagate_from:X" field.
]]

What is a dominant peer group, as distinct from the immediate master?

I can see in fs/proc_namespaces.c that there is this distinction made:

[[
/* Tagged fields ("foo:X" or "bar") */
if (IS_MNT_SHARED(r))
seq_printf(m, " shared:%i", r->mnt_group_id);
if (IS_MNT_SLAVE(r)) {
int master = r->mnt_master->mnt_group_id;
int dom = get_dominating_id(r, >root);
seq_printf(m, " master:%i", master);
if (dom && dom != master)
seq_printf(m, " propagate_from:%i", dom);
}
]]

But I can't relate that to some user-space semantics. I suppose another
way of asking my question is: how could I create a slave that is
propagating from a peer group other than it's immediate master?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


man-pages-4.06 is released

2016-05-11 Thread Michael Kerrisk (man-pages)
Gidday,

The Linux man-pages maintainer proudly announces:

man-pages-4.06 - man pages for Linux

This release includes input and contributions from
around 20 people. Around 40 pages saw changes, ranging
from typo fixes through to page rewrites and newly
created pages.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_4.06

A short summary of the release is blogged at:
http://linux-man-pages.blogspot.com/2016/05/man-pages-406-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers on LKML is shown below.

Cheers,

Michael

 Changes in man-pages-4.06 

New and rewritten pages
---

cgroups.7
Serge Hallyn, Michael Kerrisk
New page documenting cgroups

cgroup_namespaces.7
Michael Kerrisk  [Serge Hallyn]
New page describing cgroup namespaces


Newly documented interfaces in existing pages
-

clone.2
Michael Kerrisk
Document CLONE_NEWCGROUP

readv.2
Christoph Hellwig
Document preadv2() and pwritev2()

setns.2
Michael Kerrisk
Document CLONE_NEWCGROUP

unshare.2
Michael Kerrisk
Document CLONE_NEWCGROUP


Changes to individual pages
---

clock_getres.2
Michael Kerrisk  [Rasmus Villemoes]
Note that coarse clocks need architecture and VDSO support

execve.2
Michael Kerrisk  [Valery Reznic]
Since Linux 2.6.28, recursive script interpretation is supported

fcntl.2
Michael Kerrisk
Note that mandatory locking is now governed by a configuration option

mount.2
Michael Kerrisk
MS_MANDLOCK requires CAP_SYS_ADMIN (since Linux 4.5)

quotactl.2
Michael Kerrisk
Document Q_GETNEXTQUOTA and Q_XGETNEXTQUOTA

sigaction.2
Michael Kerrisk
Document SEGV_BNDERR
Michael Kerrisk
Document SEGV_PKUERR

core.5
Michael Kerrisk
Document /proc/sys/kernel/core_pipe_limit

namespaces.7
Michael Kerrisk
SEE ALSO: add cgroups(7), cgroup_namespaces(7)

vdso.7
Zubair Lutfullah Kakakhel  [Mike Frysinger]
Update for MIPS
Document the symbols exported by the MIPS VDSO.
VDSO support was added from kernel 4.4 onwards.

See 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/arch/mips/vdso
Michael Kerrisk  [Rasmus Villemoes]
The __kernel_clock_* interfaces don't support *_COARSE clocks on PowerPC

ld.so.8
Michael Kerrisk  [Alon Bar-Lev]
Document use of $ORIGIN, $LIB, and $PLATFORM in environment variables
These strings are meaningful in LD_LIBRARY_PATH and LD_PRELOAD.

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


man-pages-4.06 is released

2016-05-11 Thread Michael Kerrisk (man-pages)
Gidday,

The Linux man-pages maintainer proudly announces:

man-pages-4.06 - man pages for Linux

This release includes input and contributions from
around 20 people. Around 40 pages saw changes, ranging
from typo fixes through to page rewrites and newly
created pages.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_4.06

A short summary of the release is blogged at:
http://linux-man-pages.blogspot.com/2016/05/man-pages-406-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers on LKML is shown below.

Cheers,

Michael

 Changes in man-pages-4.06 

New and rewritten pages
---

cgroups.7
Serge Hallyn, Michael Kerrisk
New page documenting cgroups

cgroup_namespaces.7
Michael Kerrisk  [Serge Hallyn]
New page describing cgroup namespaces


Newly documented interfaces in existing pages
-

clone.2
Michael Kerrisk
Document CLONE_NEWCGROUP

readv.2
Christoph Hellwig
Document preadv2() and pwritev2()

setns.2
Michael Kerrisk
Document CLONE_NEWCGROUP

unshare.2
Michael Kerrisk
Document CLONE_NEWCGROUP


Changes to individual pages
---

clock_getres.2
Michael Kerrisk  [Rasmus Villemoes]
Note that coarse clocks need architecture and VDSO support

execve.2
Michael Kerrisk  [Valery Reznic]
Since Linux 2.6.28, recursive script interpretation is supported

fcntl.2
Michael Kerrisk
Note that mandatory locking is now governed by a configuration option

mount.2
Michael Kerrisk
MS_MANDLOCK requires CAP_SYS_ADMIN (since Linux 4.5)

quotactl.2
Michael Kerrisk
Document Q_GETNEXTQUOTA and Q_XGETNEXTQUOTA

sigaction.2
Michael Kerrisk
Document SEGV_BNDERR
Michael Kerrisk
Document SEGV_PKUERR

core.5
Michael Kerrisk
Document /proc/sys/kernel/core_pipe_limit

namespaces.7
Michael Kerrisk
SEE ALSO: add cgroups(7), cgroup_namespaces(7)

vdso.7
Zubair Lutfullah Kakakhel  [Mike Frysinger]
Update for MIPS
Document the symbols exported by the MIPS VDSO.
VDSO support was added from kernel 4.4 onwards.

See 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/arch/mips/vdso
Michael Kerrisk  [Rasmus Villemoes]
The __kernel_clock_* interfaces don't support *_COARSE clocks on PowerPC

ld.so.8
Michael Kerrisk  [Alon Bar-Lev]
Document use of $ORIGIN, $LIB, and $PLATFORM in environment variables
These strings are meaningful in LD_LIBRARY_PATH and LD_PRELOAD.

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] mountinfo: implement show_path for kernfs and cgroup

2016-05-06 Thread Michael Kerrisk (man-pages)
Hi Serge,

On 6 May 2016 at 19:33, Serge E. Hallyn <se...@hallyn.com> wrote:
> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
>> Hi Serge,
>>
>> I'll add my own notes below, as much as anything in order to convince
>> myself that I understand what's going on.
>>
>> On 05/05/2016 05:20 PM, Serge E. Hallyn wrote:
>> > Short explanation:
>> >
>> > When showing a cgroupfs entry in mountinfo, show the path of the mount
>> > root dentry relative to the reader's cgroup namespace root.
>>
>> As part of the commit message, I think it would be useful to add a
>> sentence here explain why this is needed / which applications need it.
>>
>> > Long version:
>> >
>> > When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
>> > namespace, and then mounts a new instance of the freezer cgroup, the new
>> > mount will be rooted at /a/b.  The root dentry field of the mountinfo
>> > entry will show '/a/b'.
>>
>> So, the point is that if we create a new cgroup namespace,
>> then we want both /proc/self/cgroup and /proc/self/mountinfo
>> to show cgroup paths that are correctly virtualized with
>> respect to the cgroup mount point. Previous to this patch,
>> /proc/self/cgroup shows the right info, but
>> /proc/self/mountinfo does not. (Walk through in a moment.)
>>
>> Is the above a correct summary?

Feel free to add that piece to the commit message :-).

[...]

>> So, I applied your patch against a current (i.e., 4.6-rc6) kernel.
>> Same steps as before, and here's what I see:
>>
>> # mkdir -p /sys/fs/cgroup/freezer/a/b
>> # echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs
>> # ./cgroup_info.sh
>>   /proc/self/cgroup:  10:freezer:/a/b
>>   mountinfo:  /   /sys/fs/cgroup/freezer
>> # ~mtk/tlpi/code/ns/unshare -Cm bash
>> # ./cgroup_info.sh
>>   /proc/self/cgroup:  10:freezer:/
>>   mountinfo:  /../..  /sys/fs/cgroup/freezer
>> # mount --make-rslave /
>> # mkdir -p /mnt/freezer
>> # umount /sys/fs/cgroup/freezer
>> # mount -t cgroup -o freezer freezer /mnt/freezer/
>> # ./cgroup_info.sh
>>   /proc/self/cgroup:  10:freezer:/
>>   mountinfo:  /   /mnt/freezer
>>
>> Now the root directory path shown by mountinfo is correct,
>> and when we look inside the mount point, we see that things
>> look "right" (i.e., a cgroup root directory with no
>> subdirectories, and the PID of the shell run by unshare is
>> in the cgroup.procs file of this cgroup):
>>
>> # ls /mnt/freezer/
>> cgroup.clone_children  freezer.parent_freezing  freezer.state  tasks
>> cgroup.procs   freezer.self_freezingnotify_on_release
>> # echo $$
>> 3164
>> # cat /mnt/freezer/cgroup.procs
>> 2653   # First shell that placed in this cgroup
>> 3164   # Shell started by 'unshare'
>> 14197  # cat(1)
>>
>> All makes sense to me.
>
> Right.  So in particular, docker wants to do something like:
>
> bindpath=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $4 }'`
> mountpoint=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $5 
> }'`
> mycg=`awk -F: '/freezer/ { print $3 }' /proc/self/cgroup`
> cat ${mountpoint}/${bindpath}/${mycg}/cgroup.procs
>
> and see its own task.

I think that'd be a great piece to include in the commit message, near
the top, as rationale for the patch

>> Tested-by: Michael Kerrisk <mtk.manpa...@gmail.com>
>> Acked-by: Michael Kerrisk <mtk.manpa...@gmail.com>
>>
>> (I did no review of the patch itself though.)
>
> Thanks, Michael.

You're welcome.

> I'll resend with corrections and a test script of
> some sort.

I think including some version of the two walk thoughs (without + with
patch) would also make for a great commit message :-).

Cheers,

Michael

[...]


Re: [PATCH] mountinfo: implement show_path for kernfs and cgroup

2016-05-06 Thread Michael Kerrisk (man-pages)
Hi Serge,

On 6 May 2016 at 19:33, Serge E. Hallyn  wrote:
> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
>> Hi Serge,
>>
>> I'll add my own notes below, as much as anything in order to convince
>> myself that I understand what's going on.
>>
>> On 05/05/2016 05:20 PM, Serge E. Hallyn wrote:
>> > Short explanation:
>> >
>> > When showing a cgroupfs entry in mountinfo, show the path of the mount
>> > root dentry relative to the reader's cgroup namespace root.
>>
>> As part of the commit message, I think it would be useful to add a
>> sentence here explain why this is needed / which applications need it.
>>
>> > Long version:
>> >
>> > When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
>> > namespace, and then mounts a new instance of the freezer cgroup, the new
>> > mount will be rooted at /a/b.  The root dentry field of the mountinfo
>> > entry will show '/a/b'.
>>
>> So, the point is that if we create a new cgroup namespace,
>> then we want both /proc/self/cgroup and /proc/self/mountinfo
>> to show cgroup paths that are correctly virtualized with
>> respect to the cgroup mount point. Previous to this patch,
>> /proc/self/cgroup shows the right info, but
>> /proc/self/mountinfo does not. (Walk through in a moment.)
>>
>> Is the above a correct summary?

Feel free to add that piece to the commit message :-).

[...]

>> So, I applied your patch against a current (i.e., 4.6-rc6) kernel.
>> Same steps as before, and here's what I see:
>>
>> # mkdir -p /sys/fs/cgroup/freezer/a/b
>> # echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs
>> # ./cgroup_info.sh
>>   /proc/self/cgroup:  10:freezer:/a/b
>>   mountinfo:  /   /sys/fs/cgroup/freezer
>> # ~mtk/tlpi/code/ns/unshare -Cm bash
>> # ./cgroup_info.sh
>>   /proc/self/cgroup:  10:freezer:/
>>   mountinfo:  /../..  /sys/fs/cgroup/freezer
>> # mount --make-rslave /
>> # mkdir -p /mnt/freezer
>> # umount /sys/fs/cgroup/freezer
>> # mount -t cgroup -o freezer freezer /mnt/freezer/
>> # ./cgroup_info.sh
>>   /proc/self/cgroup:  10:freezer:/
>>   mountinfo:  /   /mnt/freezer
>>
>> Now the root directory path shown by mountinfo is correct,
>> and when we look inside the mount point, we see that things
>> look "right" (i.e., a cgroup root directory with no
>> subdirectories, and the PID of the shell run by unshare is
>> in the cgroup.procs file of this cgroup):
>>
>> # ls /mnt/freezer/
>> cgroup.clone_children  freezer.parent_freezing  freezer.state  tasks
>> cgroup.procs   freezer.self_freezingnotify_on_release
>> # echo $$
>> 3164
>> # cat /mnt/freezer/cgroup.procs
>> 2653   # First shell that placed in this cgroup
>> 3164   # Shell started by 'unshare'
>> 14197  # cat(1)
>>
>> All makes sense to me.
>
> Right.  So in particular, docker wants to do something like:
>
> bindpath=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $4 }'`
> mountpoint=`grep freezer /proc/self/mountinfo | tail -n 1 | awk '{ print $5 
> }'`
> mycg=`awk -F: '/freezer/ { print $3 }' /proc/self/cgroup`
> cat ${mountpoint}/${bindpath}/${mycg}/cgroup.procs
>
> and see its own task.

I think that'd be a great piece to include in the commit message, near
the top, as rationale for the patch

>> Tested-by: Michael Kerrisk 
>> Acked-by: Michael Kerrisk 
>>
>> (I did no review of the patch itself though.)
>
> Thanks, Michael.

You're welcome.

> I'll resend with corrections and a test script of
> some sort.

I think including some version of the two walk thoughs (without + with
patch) would also make for a great commit message :-).

Cheers,

Michael

[...]


Re: [PATCH] mountinfo: implement show_path for kernfs and cgroup

2016-05-06 Thread Michael Kerrisk (man-pages)
e information in
/proc/PID/mountinfo. (The current patch fixes exactly this problem.)

> With this patch, the dentry root field in mountinfo is shown relative
> to the reader's cgroup namespace.  I.e.:
> 
>  unshare -Gm  bash /tmp/do1
>  > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - 
> cgroup cgroup rw,freezer
>  > 355 133 0:34 / /mnt rw,relatime - cgroup freezer rw,freezer
> 
> This way the task can correlate the paths in /proc/pid/cgroup to
> /proc/self/mountinfo, and determine which cgroup directory (in any
> mount which the reader created) corresponds to the task.

So, I applied your patch against a current (i.e., 4.6-rc6) kernel.
Same steps as before, and here's what I see:

# mkdir -p /sys/fs/cgroup/freezer/a/b
# echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs
# ./cgroup_info.sh 
/proc/self/cgroup:  10:freezer:/a/b
mountinfo:  /   /sys/fs/cgroup/freezer
# ~mtk/tlpi/code/ns/unshare -Cm bash
# ./cgroup_info.sh 
/proc/self/cgroup:  10:freezer:/
mountinfo:  /../..  /sys/fs/cgroup/freezer
# mount --make-rslave /
# mkdir -p /mnt/freezer
# umount /sys/fs/cgroup/freezer
# mount -t cgroup -o freezer freezer /mnt/freezer/
# ./cgroup_info.sh
/proc/self/cgroup:  10:freezer:/
mountinfo:  /   /mnt/freezer

Now the root directory path shown by mountinfo is correct,
and when we look inside the mount point, we see that things
look "right" (i.e., a cgroup root directory with no
subdirectories, and the PID of the shell run by unshare is
in the cgroup.procs file of this cgroup):

# ls /mnt/freezer/
cgroup.clone_children  freezer.parent_freezing  freezer.state  tasks
cgroup.procs   freezer.self_freezing    notify_on_release
# echo $$
3164
# cat /mnt/freezer/cgroup.procs
2653   # First shell that placed in this cgroup
3164   # Shell started by 'unshare'
14197  # cat(1)

All makes sense to me.

Tested-by: Michael Kerrisk <mtk.manpa...@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpa...@gmail.com>

(I did no review of the patch itself though.)

Cheers,

Michael


> Signed-off-by: Serge Hallyn <serge.hal...@ubuntu.com>
> ---
>  fs/kernfs/mount.c  | 14 +++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c| 63 
> ++
>  3 files changed, 79 insertions(+)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f73541f..3b78724 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "kernfs-internal.h"
>  
> @@ -40,6 +41,18 @@ static int kernfs_sop_show_options(struct seq_file *sf, 
> struct dentry *dentry)
>   return 0;
>  }
>  
> +static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
> +{
> + struct kernfs_node *node = dentry->d_fsdata;
> + struct kernfs_root *root = kernfs_root(node);
> + struct kernfs_syscall_ops *scops = root->syscall_ops;
> +
> + if (scops && scops->show_path)
> + return scops->show_path(sf, node, root);
> +
> + return seq_dentry(sf, dentry, " \t\n\\");
> +}
> +
>  const struct super_operations kernfs_sops = {
>   .statfs = simple_statfs,
>   .drop_inode = generic_delete_inode,
> @@ -47,6 +60,7 @@ const struct super_operations kernfs_sops = {
>  
>   .remount_fs = kernfs_sop_remount_fs,
>   .show_options   = kernfs_sop_show_options,
> + .show_path  = kernfs_sop_show_path,
>  };
>  
>  /**
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index c06c442..30f089e 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -152,6 +152,8 @@ struct kernfs_syscall_ops {
>   int (*rmdir)(struct kernfs_node *kn);
>   int (*rename)(struct kernfs_node *kn, struct kernfs_node *new_parent,
> const char *new_name);
> + int (*show_path)(struct seq_file *sf, struct kernfs_node *kn,
> +  struct kernfs_root *root);
>  };
>  
>  struct kernfs_root {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 909a7d3..afea39e 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1215,6 +1215,41 @@ static void cgroup_destroy_root(struct cgroup_root 
> *root)
>   cgroup_free_root(root);
>  }
>  
> +/*
> + * look up cgroup associated with current task's cgroup namespace on the
> + * specified hierarchy
> + */
> +static struct cgroup *
> +current_cgns_cgroup_from_root(struct cgroup_root *root)
> +{
> + struct cgroup *res = NULL

Re: [PATCH] mountinfo: implement show_path for kernfs and cgroup

2016-05-06 Thread Michael Kerrisk (man-pages)
e information in
/proc/PID/mountinfo. (The current patch fixes exactly this problem.)

> With this patch, the dentry root field in mountinfo is shown relative
> to the reader's cgroup namespace.  I.e.:
> 
>  unshare -Gm  bash /tmp/do1
>  > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - 
> cgroup cgroup rw,freezer
>  > 355 133 0:34 / /mnt rw,relatime - cgroup freezer rw,freezer
> 
> This way the task can correlate the paths in /proc/pid/cgroup to
> /proc/self/mountinfo, and determine which cgroup directory (in any
> mount which the reader created) corresponds to the task.

So, I applied your patch against a current (i.e., 4.6-rc6) kernel.
Same steps as before, and here's what I see:

# mkdir -p /sys/fs/cgroup/freezer/a/b
# echo $$ > /sys/fs/cgroup/freezer/a/b/cgroup.procs
# ./cgroup_info.sh 
/proc/self/cgroup:  10:freezer:/a/b
mountinfo:  /   /sys/fs/cgroup/freezer
# ~mtk/tlpi/code/ns/unshare -Cm bash
# ./cgroup_info.sh 
/proc/self/cgroup:  10:freezer:/
mountinfo:  /../..  /sys/fs/cgroup/freezer
# mount --make-rslave /
# mkdir -p /mnt/freezer
# umount /sys/fs/cgroup/freezer
# mount -t cgroup -o freezer freezer /mnt/freezer/
# ./cgroup_info.sh
/proc/self/cgroup:  10:freezer:/
mountinfo:  /   /mnt/freezer

Now the root directory path shown by mountinfo is correct,
and when we look inside the mount point, we see that things
look "right" (i.e., a cgroup root directory with no
subdirectories, and the PID of the shell run by unshare is
in the cgroup.procs file of this cgroup):

# ls /mnt/freezer/
cgroup.clone_children  freezer.parent_freezing  freezer.state  tasks
cgroup.procs   freezer.self_freezing    notify_on_release
# echo $$
3164
# cat /mnt/freezer/cgroup.procs
2653   # First shell that placed in this cgroup
3164   # Shell started by 'unshare'
14197  # cat(1)

All makes sense to me.

Tested-by: Michael Kerrisk 
Acked-by: Michael Kerrisk 

(I did no review of the patch itself though.)

Cheers,

Michael


> Signed-off-by: Serge Hallyn 
> ---
>  fs/kernfs/mount.c  | 14 +++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c| 63 
> ++
>  3 files changed, 79 insertions(+)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f73541f..3b78724 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "kernfs-internal.h"
>  
> @@ -40,6 +41,18 @@ static int kernfs_sop_show_options(struct seq_file *sf, 
> struct dentry *dentry)
>   return 0;
>  }
>  
> +static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
> +{
> + struct kernfs_node *node = dentry->d_fsdata;
> + struct kernfs_root *root = kernfs_root(node);
> + struct kernfs_syscall_ops *scops = root->syscall_ops;
> +
> + if (scops && scops->show_path)
> + return scops->show_path(sf, node, root);
> +
> + return seq_dentry(sf, dentry, " \t\n\\");
> +}
> +
>  const struct super_operations kernfs_sops = {
>   .statfs = simple_statfs,
>   .drop_inode = generic_delete_inode,
> @@ -47,6 +60,7 @@ const struct super_operations kernfs_sops = {
>  
>   .remount_fs = kernfs_sop_remount_fs,
>   .show_options   = kernfs_sop_show_options,
> + .show_path  = kernfs_sop_show_path,
>  };
>  
>  /**
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index c06c442..30f089e 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -152,6 +152,8 @@ struct kernfs_syscall_ops {
>   int (*rmdir)(struct kernfs_node *kn);
>   int (*rename)(struct kernfs_node *kn, struct kernfs_node *new_parent,
> const char *new_name);
> + int (*show_path)(struct seq_file *sf, struct kernfs_node *kn,
> +  struct kernfs_root *root);
>  };
>  
>  struct kernfs_root {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 909a7d3..afea39e 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1215,6 +1215,41 @@ static void cgroup_destroy_root(struct cgroup_root 
> *root)
>   cgroup_free_root(root);
>  }
>  
> +/*
> + * look up cgroup associated with current task's cgroup namespace on the
> + * specified hierarchy
> + */
> +static struct cgroup *
> +current_cgns_cgroup_from_root(struct cgroup_root *root)
> +{
> + struct cgroup *res = NULL;
> + struct css_set *cset;
> +
> + lockdep_assert_held(_set_lock);
> +
> + rcu_read_lock();

Re: [PATCH 1/1] simplified security.nscapability xattr

2016-05-02 Thread Michael Kerrisk (man-pages)
On 05/02/2016 05:54 AM, Serge E. Hallyn wrote:
> On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
>> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <se...@hallyn.com> wrote:
>>> Quoting Kees Cook (keesc...@chromium.org):
>>>> On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hal...@ubuntu.com> wrote:
>>>>> From: Serge Hallyn <serge.hal...@ubuntu.com>
> ...
>>>> This looks like userspace must knowingly be aware that it is in a
>>>> namespace and to DTRT instead of it being translated by the kernel
>>>> when setxattr is called under !init_user_ns?
>>>
>>> Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If that
>>> shows you are in init_user_ns then it uses security.capability, otherwise
>>> it uses security.nscapability.
>>>
>>> I've occasionally considered having the xattr code do the quiet
>>> substitution if need be.
>>>
>>> In fact, much of this structure comes from when I was still trying to
>>> do multiple values per xattr.  Given what we're doing here, we could
>>> keep the xattr contents exactly the same, just changing the name.
>>> So userspace could just get and set security.capability;  if you are
>>> in a non-init user_ns, if security.capability is set then you cannot
>>> set it;  if security.capability is not set, then the kernel writes
>>> security.nscapability instead and returns success.
>>>
>>> I don't like magic, but this might be just straightforward enough
>>> to not be offensive.  Thoughts?
>>
>> Yeah, I think it might be better to have the magic in this case, since
>> it seems weird to just reject setxattr if a tool didn't realize it was
>> in a namespace. I'm not sure -- it is also nice to have an explicit
>> API here.
>>
>> I would defer to Eric or Michael on that. I keep going back and forth,
>> though I suspect it's probably best to do what you already have
>> (explicit API).
> 
> Michael, Eric, what do you think?  The choice we're making here is
> whether we should
> 
> 1. Keep a nice simple separate pair of xattrs, the pre-existing
> security.capability which can only be written from init_user_ns,
> and the new (in this patch) security.nscapability which you can
> write to any file where you are privileged wrt the file.
> 
> 2. Make security.capability somewhat 'magic' - if someone in a
> non-initial user ns tries to write it and has privilege wrt the
> file, then the kernel silently writes security.nscapability instead.
> 
> The biggest drawback of (1) would be any tar-like program trying
> to restore a file which had security.capability, needing to know
> to detect its userns and write the security.nscapability instead.
> The drawback of (2) is ~\o/~ magic.

I have only (minor) thoughts from the interface perspective.
(1) Sounds the source of possibly unpleasant surprises.
(2) Is a little surprising, but less so if it's well documented,
and it saves us the surprises of (1). So, (2) sounds better.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 1/1] simplified security.nscapability xattr

2016-05-02 Thread Michael Kerrisk (man-pages)
On 05/02/2016 05:54 AM, Serge E. Hallyn wrote:
> On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
>> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn  wrote:
>>> Quoting Kees Cook (keesc...@chromium.org):
>>>> On Fri, Apr 22, 2016 at 10:26 AM,   wrote:
>>>>> From: Serge Hallyn 
> ...
>>>> This looks like userspace must knowingly be aware that it is in a
>>>> namespace and to DTRT instead of it being translated by the kernel
>>>> when setxattr is called under !init_user_ns?
>>>
>>> Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If that
>>> shows you are in init_user_ns then it uses security.capability, otherwise
>>> it uses security.nscapability.
>>>
>>> I've occasionally considered having the xattr code do the quiet
>>> substitution if need be.
>>>
>>> In fact, much of this structure comes from when I was still trying to
>>> do multiple values per xattr.  Given what we're doing here, we could
>>> keep the xattr contents exactly the same, just changing the name.
>>> So userspace could just get and set security.capability;  if you are
>>> in a non-init user_ns, if security.capability is set then you cannot
>>> set it;  if security.capability is not set, then the kernel writes
>>> security.nscapability instead and returns success.
>>>
>>> I don't like magic, but this might be just straightforward enough
>>> to not be offensive.  Thoughts?
>>
>> Yeah, I think it might be better to have the magic in this case, since
>> it seems weird to just reject setxattr if a tool didn't realize it was
>> in a namespace. I'm not sure -- it is also nice to have an explicit
>> API here.
>>
>> I would defer to Eric or Michael on that. I keep going back and forth,
>> though I suspect it's probably best to do what you already have
>> (explicit API).
> 
> Michael, Eric, what do you think?  The choice we're making here is
> whether we should
> 
> 1. Keep a nice simple separate pair of xattrs, the pre-existing
> security.capability which can only be written from init_user_ns,
> and the new (in this patch) security.nscapability which you can
> write to any file where you are privileged wrt the file.
> 
> 2. Make security.capability somewhat 'magic' - if someone in a
> non-initial user ns tries to write it and has privilege wrt the
> file, then the kernel silently writes security.nscapability instead.
> 
> The biggest drawback of (1) would be any tar-like program trying
> to restore a file which had security.capability, needing to know
> to detect its userns and write the security.nscapability instead.
> The drawback of (2) is ~\o/~ magic.

I have only (minor) thoughts from the interface perspective.
(1) Sounds the source of possibly unpleasant surprises.
(2) Is a little surprising, but less so if it's well documented,
and it saves us the surprises of (1). So, (2) sounds better.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] Implement leftpad syscall

2016-03-31 Thread Michael Kerrisk (man-pages)
On 04/01/2016 11:33 AM, Richard Weinberger wrote:
> From: David Gstir <da...@sigma-star.at>
> 
> Implement the leftpad() system call such that userspace,
> especially node.js applications, can in the near future directly
> use it and no longer depend on fragile npm packages.

Works can't express the importance of adding this system call!
Thanks so much for proposing and implementing it!

Acked-by: Michael Kerrisk <mtk.manpa...@gmail.com>

Cheers,

Michael

> Signed-off-by: David Gstir <da...@sigma-star.at>
> Signed-off-by: Richard Weinberger <rich...@nod.at>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/syscalls.h   |  1 +
>  kernel/sys.c   | 35 
> ++
>  kernel/sys_ni.c|  1 +
>  4 files changed, 38 insertions(+)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index cac6d17..f287712 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -335,6 +335,7 @@
>  326  common  copy_file_range sys_copy_file_range
>  327  64  preadv2 sys_preadv2
>  328  64  pwritev2sys_pwritev2
> +329  common  leftpad sys_leftpad
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index d795472..a0850bb 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -898,4 +898,5 @@ asmlinkage long sys_copy_file_range(int fd_in, loff_t 
> __user *off_in,
>  
>  asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
>  
> +asmlinkage long sys_leftpad(char *str, char pad, char *dst, size_t dst_len);
>  #endif
> diff --git a/kernel/sys.c b/kernel/sys.c
> index cf8ba54..e42d972 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2432,3 +2432,38 @@ COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo 
> __user *, info)
>   return 0;
>  }
>  #endif /* CONFIG_COMPAT */
> +
> +
> +SYSCALL_DEFINE4(leftpad, char *, src, char, pad, char *, dst, size_t, 
> dst_len)
> +{
> + char *buf;
> + long ret;
> + size_t len = strlen_user(src);
> + size_t pad_len = dst_len - len;
> +
> + if (dst_len <= len || dst_len > 4096) {
> + return -EINVAL;
> + }
> +
> + buf = kmalloc(dst_len, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + memset(buf, pad, pad_len);
> + ret = copy_from_user(buf + pad_len, src, len);
> + if (ret) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = copy_to_user(dst, buf, dst_len);
> + if (ret) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = pad_len;
> +out:
> + kfree(buf);
> + return ret;
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 2c5e3a8..262608d 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -175,6 +175,7 @@ cond_syscall(sys_setfsgid);
>  cond_syscall(sys_capget);
>  cond_syscall(sys_capset);
>  cond_syscall(sys_copy_file_range);
> +cond_syscall(sys_leftpad);
>  
>  /* arch-specific weak syscall entries */
>  cond_syscall(sys_pciconfig_read);
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] Implement leftpad syscall

2016-03-31 Thread Michael Kerrisk (man-pages)
On 04/01/2016 11:33 AM, Richard Weinberger wrote:
> From: David Gstir 
> 
> Implement the leftpad() system call such that userspace,
> especially node.js applications, can in the near future directly
> use it and no longer depend on fragile npm packages.

Works can't express the importance of adding this system call!
Thanks so much for proposing and implementing it!

Acked-by: Michael Kerrisk 

Cheers,

Michael

> Signed-off-by: David Gstir 
> Signed-off-by: Richard Weinberger 
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/syscalls.h   |  1 +
>  kernel/sys.c   | 35 
> ++
>  kernel/sys_ni.c|  1 +
>  4 files changed, 38 insertions(+)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index cac6d17..f287712 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -335,6 +335,7 @@
>  326  common  copy_file_range sys_copy_file_range
>  327  64  preadv2 sys_preadv2
>  328  64  pwritev2sys_pwritev2
> +329  common  leftpad sys_leftpad
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index d795472..a0850bb 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -898,4 +898,5 @@ asmlinkage long sys_copy_file_range(int fd_in, loff_t 
> __user *off_in,
>  
>  asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
>  
> +asmlinkage long sys_leftpad(char *str, char pad, char *dst, size_t dst_len);
>  #endif
> diff --git a/kernel/sys.c b/kernel/sys.c
> index cf8ba54..e42d972 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2432,3 +2432,38 @@ COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo 
> __user *, info)
>   return 0;
>  }
>  #endif /* CONFIG_COMPAT */
> +
> +
> +SYSCALL_DEFINE4(leftpad, char *, src, char, pad, char *, dst, size_t, 
> dst_len)
> +{
> + char *buf;
> + long ret;
> + size_t len = strlen_user(src);
> + size_t pad_len = dst_len - len;
> +
> + if (dst_len <= len || dst_len > 4096) {
> + return -EINVAL;
> + }
> +
> + buf = kmalloc(dst_len, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + memset(buf, pad, pad_len);
> + ret = copy_from_user(buf + pad_len, src, len);
> + if (ret) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = copy_to_user(dst, buf, dst_len);
> + if (ret) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = pad_len;
> +out:
> + kfree(buf);
> + return ret;
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 2c5e3a8..262608d 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -175,6 +175,7 @@ cond_syscall(sys_setfsgid);
>  cond_syscall(sys_capget);
>  cond_syscall(sys_capset);
>  cond_syscall(sys_copy_file_range);
> +cond_syscall(sys_leftpad);
>  
>  /* arch-specific weak syscall entries */
>  cond_syscall(sys_pciconfig_read);
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


man-pages-4.05 is released

2016-03-15 Thread Michael Kerrisk (man-pages)
Gidday,

The Linux man-pages maintainer proudly announces:

man-pages-4.05 - man pages for Linux

This release includes input and contributions from
nearly 70 people. Over 400 pages saw changes, ranging
from typo fixes through to page rewrites and newly
created pages.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_4.05

A short summary of the release is blogged at:
http://linux-man-pages.blogspot.com/2016/03/man-pages-405-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
yo readers on LKML is shown below.

Cheers,

Michael


 Changes in man-pages-4.05 


New and rewritten pages
---

copy_file_range.2
Anna Schumaker  [Darrick J. Wong, Christoph Hellwig, Michael Kerrisk]
New page documenting copy_file_range()

personality.2
Michael Kerrisk
This page has been greatly expanded, to add descriptions of
personality domains.

fmemopen.3
Michael Kerrisk  [Adhemerval Zanella]
Significant reworking of this page:
* Rework discussion of the (obsolete) binary mode
* Split open_memstream(3) description into a separate page.
* Note various fmemopen() bugs that were fixed in glibc 2.22
* Greatly expand description of 'mode' argument
* Rework description of 'buf' and 'len' arguments
* Expand discussion of "current position" for fmemopen() stream

ntp_gettime.3
    Michael Kerrisk
New page describing ntp_gettime(3) and ntp_gettimex(3)

open_memstream.3
    Michael Kerrisk
New page created by split of fmemopen(3).
At the same time, add and rework a few details in the text.

posix_spawn.3
Bill O. Gallmeister, Michael Kerrisk
New man page documenting posix_spawn(3) and posix_spawnp(3)

readdir.3
    Michael Kerrisk  [Florian Weimer]
Split readdir_r() content into separate page
    Michael Kerrisk
Near complete restructuring of the page and add some further details
    Michael Kerrisk  [Florian Weimer, Rich Felker, Paul Eggert]
Add a lot more detail on portable use of the 'd_name' field

readdir_r.3
    Michael Kerrisk  [Florian Weimer]
New page created after split of readdir(3).
    Michael Kerrisk  [Florian Weimer]
Explain why readdir_r() is deprecated and readdir() is preferred

lirc.4
Alec Leamas
New page documenting lirc device driver


Newly documented interfaces in existing pages
-

epoll_ctl.2
    Michael Kerrisk  [Jason Baron]
Document EPOLLEXCLUSIVE

madvise.2
Minchan Kim  [Michael Kerrisk]
Document MADV_FREE
Document the MADV_FREE flag added to madvise() in Linux 4.5.

proc.5
    Michael Kerrisk
Document CmaTotal and CmaFree fields of /proc/meminfo
    Michael Kerrisk
Document additional /proc/meminfo fields
Document DirectMap4k, DirectMap4M, DirectMap2M, DirectMap1G
    Michael Kerrisk
Document MemAvailable /proc/meminfo field
    Michael Kerrisk
Document inotify /proc/PID/fdinfo entries
    Michael Kerrisk
Document fanotify /proc/PID/fdinfo entries
    Michael Kerrisk
Add some kernel version numbers for /proc/PID/fdinfo entries
    Michael Kerrisk  [Patrick Donnelly]
/proc/PID/fdinfo displays the setting of the close-on-exec flag
Note also the pre-3.1 bug in the display of this info.

socket.7
Craig Gallek  [Michael Kerrisk, Vincent Bernat]
Document some BPF-related socket options
Document the behavior and the first kernel version for each of the
following socket options:

SO_ATTACH_FILTER
SO_ATTACH_BPF
SO_ATTACH_REUSEPORT_CBPF
SO_ATTACH_REUSEPORT_EBPF
SO_DETACH_FILTER
SO_DETACH_BPF
SO_LOCK_FILTER


Global changes
--

Many, many pages
    Michael Kerrisk
Update, simplify and correct feature test macro requirements


Changes to individual pages
---

adjtimex.2
    Michael Kerrisk  [John Stultz]
Various improvements after feedback from John Stultz


syscall.2
Mike Frysinger
Add more architectures and improve error documentation
Move the error register documentation into the main table rather
than listing them in sentences after the fact.

Add sparc error return details.

Add details for alpha/arc/m68k/microblaze/nios2/powerpc/superh/
tile/xtensa.

feature_test_macros.7
    Michael Kerrisk
Add a summary of some FTM key points
    Michael Kerrisk
  

man-pages-4.05 is released

2016-03-15 Thread Michael Kerrisk (man-pages)
Gidday,

The Linux man-pages maintainer proudly announces:

man-pages-4.05 - man pages for Linux

This release includes input and contributions from
nearly 70 people. Over 400 pages saw changes, ranging
from typo fixes through to page rewrites and newly
created pages.

Tarball download:
http://www.kernel.org/doc/man-pages/download.html
Git repository:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
http://man7.org/linux/man-pages/changelog.html#release_4.05

A short summary of the release is blogged at:
http://linux-man-pages.blogspot.com/2016/03/man-pages-405-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
yo readers on LKML is shown below.

Cheers,

Michael


 Changes in man-pages-4.05 


New and rewritten pages
---

copy_file_range.2
Anna Schumaker  [Darrick J. Wong, Christoph Hellwig, Michael Kerrisk]
New page documenting copy_file_range()

personality.2
Michael Kerrisk
This page has been greatly expanded, to add descriptions of
personality domains.

fmemopen.3
Michael Kerrisk  [Adhemerval Zanella]
Significant reworking of this page:
* Rework discussion of the (obsolete) binary mode
* Split open_memstream(3) description into a separate page.
* Note various fmemopen() bugs that were fixed in glibc 2.22
* Greatly expand description of 'mode' argument
* Rework description of 'buf' and 'len' arguments
* Expand discussion of "current position" for fmemopen() stream

ntp_gettime.3
    Michael Kerrisk
New page describing ntp_gettime(3) and ntp_gettimex(3)

open_memstream.3
    Michael Kerrisk
New page created by split of fmemopen(3).
At the same time, add and rework a few details in the text.

posix_spawn.3
Bill O. Gallmeister, Michael Kerrisk
New man page documenting posix_spawn(3) and posix_spawnp(3)

readdir.3
    Michael Kerrisk  [Florian Weimer]
Split readdir_r() content into separate page
    Michael Kerrisk
Near complete restructuring of the page and add some further details
    Michael Kerrisk  [Florian Weimer, Rich Felker, Paul Eggert]
Add a lot more detail on portable use of the 'd_name' field

readdir_r.3
    Michael Kerrisk  [Florian Weimer]
New page created after split of readdir(3).
    Michael Kerrisk  [Florian Weimer]
Explain why readdir_r() is deprecated and readdir() is preferred

lirc.4
Alec Leamas
New page documenting lirc device driver


Newly documented interfaces in existing pages
-

epoll_ctl.2
    Michael Kerrisk  [Jason Baron]
Document EPOLLEXCLUSIVE

madvise.2
Minchan Kim  [Michael Kerrisk]
Document MADV_FREE
Document the MADV_FREE flag added to madvise() in Linux 4.5.

proc.5
    Michael Kerrisk
Document CmaTotal and CmaFree fields of /proc/meminfo
    Michael Kerrisk
Document additional /proc/meminfo fields
Document DirectMap4k, DirectMap4M, DirectMap2M, DirectMap1G
    Michael Kerrisk
Document MemAvailable /proc/meminfo field
    Michael Kerrisk
Document inotify /proc/PID/fdinfo entries
    Michael Kerrisk
Document fanotify /proc/PID/fdinfo entries
    Michael Kerrisk
Add some kernel version numbers for /proc/PID/fdinfo entries
    Michael Kerrisk  [Patrick Donnelly]
/proc/PID/fdinfo displays the setting of the close-on-exec flag
Note also the pre-3.1 bug in the display of this info.

socket.7
Craig Gallek  [Michael Kerrisk, Vincent Bernat]
Document some BPF-related socket options
Document the behavior and the first kernel version for each of the
following socket options:

SO_ATTACH_FILTER
SO_ATTACH_BPF
SO_ATTACH_REUSEPORT_CBPF
SO_ATTACH_REUSEPORT_EBPF
SO_DETACH_FILTER
SO_DETACH_BPF
SO_LOCK_FILTER


Global changes
--

Many, many pages
    Michael Kerrisk
Update, simplify and correct feature test macro requirements


Changes to individual pages
---

adjtimex.2
    Michael Kerrisk  [John Stultz]
Various improvements after feedback from John Stultz


syscall.2
Mike Frysinger
Add more architectures and improve error documentation
Move the error register documentation into the main table rather
than listing them in sentences after the fact.

Add sparc error return details.

Add details for alpha/arc/m68k/microblaze/nios2/powerpc/superh/
tile/xtensa.

feature_test_macros.7
    Michael Kerrisk
Add a summary of some FTM key points
    Michael Kerrisk
  

Re: [PATCH] epoll: add exclusive wakeups flag

2016-03-14 Thread Michael Kerrisk (man-pages)
Hi Jason,

On 03/15/2016 11:35 AM, Jason Baron wrote:
> Hi Michael,
> 
> On 03/14/2016 05:03 PM, Michael Kerrisk (man-pages) wrote:
>> Hi Jason,
>>
>> On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote:
>>> Hi Jason,
>>>
>>> On 03/15/2016 08:32 AM, Jason Baron wrote:
>>>>
>>>>
>>>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>>>>> [Restoring CC, which I see I accidentally dropped, one iteration back.]
>>
>> [...]
>>
>>>>> Returning to the second sentence in this description:
>>>>>
>>>>>   When a wakeup event occurs and multiple epoll file descrip‐
>>>>>   tors are attached to the same target file using EPOLLEXCLU‐
>>>>>   SIVE, one or  more  of  the  epoll  file  descriptors  will
>>>>>   receive  an  event with epoll_wait(2).
>>>>>
>>>>> There is a point that is unclear to me: what does "target file" refer to?
>>>>> Is it an open file description (aka open file table entry) or an inode?
>>>>> I suspect the former, but it was not clear in your original text.
>>>>>
>>>>
>>>> So from epoll's perspective, the wakeups are associated with a 'wait
>>>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
>>>> file->poll()) results in adding to the same 'wait queue' then we will
>>>> get 'exclusive' wakeup behavior.
>>>>
>>>> So in general, I think the answer here is that its associated with the
>>>> inode (I coudn't say with 100% certainty without really looking at all
>>>> file->poll() implementations). Certainly, with the 'FIFO' example below,
>>>> the two scenarios will have the same behavior with respect to
>>>> EPOLLEXCLUSIVE.
>>
>> So, I was actually a little surprised by this, and went away and tested
>> this point. It appears to me that that the two scenarios described below
>> do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below.
>>
>>> So, in both scenarios, *one or more* processes will get a wakeup?
>>> (I'll try to add something to the text to clarify the detail we're 
>>> discussing.)
>>>
>>>> Also, the 'non-exclusive' mode would be subject to the same question of
>>>> which wait queue is the epfd is associated with...
>>>
>>> I'm not sure of the point you are trying to make here?
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>>
>>>>> To make this point even clearer, here are two scenarios I'm thinking of.
>>>>> In each case, we're talking of monitoring the read end of a FIFO.
>>>>>
>>>>> ===
>>>>>
>>>>> Scenario 1:
>>>>>
>>>>> We have three processes each of which
>>>>> 1. Creates an epoll instance
>>>>> 2. Opens the read end of the FIFO
>>>>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>>>>>EPOLLEXCLUSIVE
>>>>>
>>>>> When input becomes available on the FIFO, how many processes
>>>>> get a wakeup?
>>
>> When I test this scenario, all three processes get a wakeup.
>>
>>>>> ===
>>>>>
>>>>> Scenario 3
>>>>>
>>>>> A parent process opens the read end of a FIFO and then calls
>>>>> fork() three times to create three children. Each child then:
>>>>>
>>>>> 1. Creates an epoll instance
>>>>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>>>>> EPOLLEXCLUSIVE
>>>>>
>>>>> When input becomes available on the FIFO, how many processes
>>>>> get a wakeup?
>>
>> When I test this scenario, one process gets a wakeup.
>>
>> In other words, "target file" appears to mean open file description
>> (aka open file table entry), not inode.
>>
>> This is actually what I suspected might be the case, but now I am
>> puzzled. Given what I've discovered and what you suggest are the
>> semantics, is the implementation correct? (I suspect that it is,
>> but it is at odds with your statement above. My test programs are
>> inline below.
>>
>> Cheers,
>>
>> Michael
>>
> 
> Thanks for the test cases. So in your first test case, you are exiting
> immediatel

Re: [PATCH] epoll: add exclusive wakeups flag

2016-03-14 Thread Michael Kerrisk (man-pages)
Hi Jason,

On 03/15/2016 11:35 AM, Jason Baron wrote:
> Hi Michael,
> 
> On 03/14/2016 05:03 PM, Michael Kerrisk (man-pages) wrote:
>> Hi Jason,
>>
>> On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote:
>>> Hi Jason,
>>>
>>> On 03/15/2016 08:32 AM, Jason Baron wrote:
>>>>
>>>>
>>>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>>>>> [Restoring CC, which I see I accidentally dropped, one iteration back.]
>>
>> [...]
>>
>>>>> Returning to the second sentence in this description:
>>>>>
>>>>>   When a wakeup event occurs and multiple epoll file descrip‐
>>>>>   tors are attached to the same target file using EPOLLEXCLU‐
>>>>>   SIVE, one or  more  of  the  epoll  file  descriptors  will
>>>>>   receive  an  event with epoll_wait(2).
>>>>>
>>>>> There is a point that is unclear to me: what does "target file" refer to?
>>>>> Is it an open file description (aka open file table entry) or an inode?
>>>>> I suspect the former, but it was not clear in your original text.
>>>>>
>>>>
>>>> So from epoll's perspective, the wakeups are associated with a 'wait
>>>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
>>>> file->poll()) results in adding to the same 'wait queue' then we will
>>>> get 'exclusive' wakeup behavior.
>>>>
>>>> So in general, I think the answer here is that its associated with the
>>>> inode (I coudn't say with 100% certainty without really looking at all
>>>> file->poll() implementations). Certainly, with the 'FIFO' example below,
>>>> the two scenarios will have the same behavior with respect to
>>>> EPOLLEXCLUSIVE.
>>
>> So, I was actually a little surprised by this, and went away and tested
>> this point. It appears to me that that the two scenarios described below
>> do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below.
>>
>>> So, in both scenarios, *one or more* processes will get a wakeup?
>>> (I'll try to add something to the text to clarify the detail we're 
>>> discussing.)
>>>
>>>> Also, the 'non-exclusive' mode would be subject to the same question of
>>>> which wait queue is the epfd is associated with...
>>>
>>> I'm not sure of the point you are trying to make here?
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>>
>>>>> To make this point even clearer, here are two scenarios I'm thinking of.
>>>>> In each case, we're talking of monitoring the read end of a FIFO.
>>>>>
>>>>> ===
>>>>>
>>>>> Scenario 1:
>>>>>
>>>>> We have three processes each of which
>>>>> 1. Creates an epoll instance
>>>>> 2. Opens the read end of the FIFO
>>>>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>>>>>EPOLLEXCLUSIVE
>>>>>
>>>>> When input becomes available on the FIFO, how many processes
>>>>> get a wakeup?
>>
>> When I test this scenario, all three processes get a wakeup.
>>
>>>>> ===
>>>>>
>>>>> Scenario 3
>>>>>
>>>>> A parent process opens the read end of a FIFO and then calls
>>>>> fork() three times to create three children. Each child then:
>>>>>
>>>>> 1. Creates an epoll instance
>>>>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>>>>> EPOLLEXCLUSIVE
>>>>>
>>>>> When input becomes available on the FIFO, how many processes
>>>>> get a wakeup?
>>
>> When I test this scenario, one process gets a wakeup.
>>
>> In other words, "target file" appears to mean open file description
>> (aka open file table entry), not inode.
>>
>> This is actually what I suspected might be the case, but now I am
>> puzzled. Given what I've discovered and what you suggest are the
>> semantics, is the implementation correct? (I suspect that it is,
>> but it is at odds with your statement above. My test programs are
>> inline below.
>>
>> Cheers,
>>
>> Michael
>>
> 
> Thanks for the test cases. So in your first test case, you are exiting
> immediatel

Re: [PATCH] epoll: add exclusive wakeups flag

2016-03-14 Thread Michael Kerrisk (man-pages)
Hi Jason,

On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote:
> Hi Jason,
> 
> On 03/15/2016 08:32 AM, Jason Baron wrote:
>>
>>
>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>>> [Restoring CC, which I see I accidentally dropped, one iteration back.]

[...]

>>> Returning to the second sentence in this description:
>>>
>>>   When a wakeup event occurs and multiple epoll file descrip‐
>>>   tors are attached to the same target file using EPOLLEXCLU‐
>>>   SIVE, one or  more  of  the  epoll  file  descriptors  will
>>>   receive  an  event with epoll_wait(2).
>>>
>>> There is a point that is unclear to me: what does "target file" refer to?
>>> Is it an open file description (aka open file table entry) or an inode?
>>> I suspect the former, but it was not clear in your original text.
>>>
>>
>> So from epoll's perspective, the wakeups are associated with a 'wait
>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
>> file->poll()) results in adding to the same 'wait queue' then we will
>> get 'exclusive' wakeup behavior.
>>
>> So in general, I think the answer here is that its associated with the
>> inode (I coudn't say with 100% certainty without really looking at all
>> file->poll() implementations). Certainly, with the 'FIFO' example below,
>> the two scenarios will have the same behavior with respect to
>> EPOLLEXCLUSIVE.

So, I was actually a little surprised by this, and went away and tested
this point. It appears to me that that the two scenarios described below
do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below.

> So, in both scenarios, *one or more* processes will get a wakeup?
> (I'll try to add something to the text to clarify the detail we're 
> discussing.)
> 
>> Also, the 'non-exclusive' mode would be subject to the same question of
>> which wait queue is the epfd is associated with...
> 
> I'm not sure of the point you are trying to make here?
> 
> Cheers,
> 
> Michael
> 
> 
>>> To make this point even clearer, here are two scenarios I'm thinking of.
>>> In each case, we're talking of monitoring the read end of a FIFO.
>>>
>>> ===
>>>
>>> Scenario 1:
>>>
>>> We have three processes each of which
>>> 1. Creates an epoll instance
>>> 2. Opens the read end of the FIFO
>>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>>>EPOLLEXCLUSIVE
>>>
>>> When input becomes available on the FIFO, how many processes
>>> get a wakeup?

When I test this scenario, all three processes get a wakeup.

>>> ===
>>>
>>> Scenario 3
>>>
>>> A parent process opens the read end of a FIFO and then calls
>>> fork() three times to create three children. Each child then:
>>>
>>> 1. Creates an epoll instance
>>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>>> EPOLLEXCLUSIVE
>>>
>>> When input becomes available on the FIFO, how many processes
>>> get a wakeup?

When I test this scenario, one process gets a wakeup.

In other words, "target file" appears to mean open file description
(aka open file table entry), not inode.

This is actually what I suspected might be the case, but now I am
puzzled. Given what I've discovered and what you suggest are the
semantics, is the implementation correct? (I suspect that it is,
but it is at odds with your statement above. My test programs are
inline below.

Cheers,

Michael



/* t_EPOLLEXCLUSIVE_multipen.c

   Licensed under GNU GPLv2 or later.
*/
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define errExit(msg)do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define usageErr(msg, progName) \
do { fprintf(stderr, "Usage: "); \
 fprintf(stderr, msg, progName); \
 exit(EXIT_FAILURE); } while (0)

#ifndef EPOLLEXCLUSIVE
#define EPOLLEXCLUSIVE (1 << 28)
#endif

int
main(int argc, char *argv[])
{
int fd, epfd, nready;
struct epoll_event ev, rev;

if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s n", argv[0]);

epfd = epoll_create(2);
if (epfd == -1)
errExit("epoll_create");

fd = open(argv[1], O_RDONLY);
if (fd == -1)
errExit("open");
printf("Opened %s\n", argv[1]);

ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (epoll_ctl(

Re: [PATCH] epoll: add exclusive wakeups flag

2016-03-14 Thread Michael Kerrisk (man-pages)
Hi Jason,

On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote:
> Hi Jason,
> 
> On 03/15/2016 08:32 AM, Jason Baron wrote:
>>
>>
>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>>> [Restoring CC, which I see I accidentally dropped, one iteration back.]

[...]

>>> Returning to the second sentence in this description:
>>>
>>>   When a wakeup event occurs and multiple epoll file descrip‐
>>>   tors are attached to the same target file using EPOLLEXCLU‐
>>>   SIVE, one or  more  of  the  epoll  file  descriptors  will
>>>   receive  an  event with epoll_wait(2).
>>>
>>> There is a point that is unclear to me: what does "target file" refer to?
>>> Is it an open file description (aka open file table entry) or an inode?
>>> I suspect the former, but it was not clear in your original text.
>>>
>>
>> So from epoll's perspective, the wakeups are associated with a 'wait
>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
>> file->poll()) results in adding to the same 'wait queue' then we will
>> get 'exclusive' wakeup behavior.
>>
>> So in general, I think the answer here is that its associated with the
>> inode (I coudn't say with 100% certainty without really looking at all
>> file->poll() implementations). Certainly, with the 'FIFO' example below,
>> the two scenarios will have the same behavior with respect to
>> EPOLLEXCLUSIVE.

So, I was actually a little surprised by this, and went away and tested
this point. It appears to me that that the two scenarios described below
do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below.

> So, in both scenarios, *one or more* processes will get a wakeup?
> (I'll try to add something to the text to clarify the detail we're 
> discussing.)
> 
>> Also, the 'non-exclusive' mode would be subject to the same question of
>> which wait queue is the epfd is associated with...
> 
> I'm not sure of the point you are trying to make here?
> 
> Cheers,
> 
> Michael
> 
> 
>>> To make this point even clearer, here are two scenarios I'm thinking of.
>>> In each case, we're talking of monitoring the read end of a FIFO.
>>>
>>> ===
>>>
>>> Scenario 1:
>>>
>>> We have three processes each of which
>>> 1. Creates an epoll instance
>>> 2. Opens the read end of the FIFO
>>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>>>EPOLLEXCLUSIVE
>>>
>>> When input becomes available on the FIFO, how many processes
>>> get a wakeup?

When I test this scenario, all three processes get a wakeup.

>>> ===
>>>
>>> Scenario 3
>>>
>>> A parent process opens the read end of a FIFO and then calls
>>> fork() three times to create three children. Each child then:
>>>
>>> 1. Creates an epoll instance
>>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>>> EPOLLEXCLUSIVE
>>>
>>> When input becomes available on the FIFO, how many processes
>>> get a wakeup?

When I test this scenario, one process gets a wakeup.

In other words, "target file" appears to mean open file description
(aka open file table entry), not inode.

This is actually what I suspected might be the case, but now I am
puzzled. Given what I've discovered and what you suggest are the
semantics, is the implementation correct? (I suspect that it is,
but it is at odds with your statement above. My test programs are
inline below.

Cheers,

Michael



/* t_EPOLLEXCLUSIVE_multipen.c

   Licensed under GNU GPLv2 or later.
*/
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define errExit(msg)do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define usageErr(msg, progName) \
do { fprintf(stderr, "Usage: "); \
 fprintf(stderr, msg, progName); \
 exit(EXIT_FAILURE); } while (0)

#ifndef EPOLLEXCLUSIVE
#define EPOLLEXCLUSIVE (1 << 28)
#endif

int
main(int argc, char *argv[])
{
int fd, epfd, nready;
struct epoll_event ev, rev;

if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s n", argv[0]);

epfd = epoll_create(2);
if (epfd == -1)
errExit("epoll_create");

fd = open(argv[1], O_RDONLY);
if (fd == -1)
errExit("open");
printf("Opened %s\n", argv[1]);

ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (epoll_ctl(

<    5   6   7   8   9   10   11   12   13   14   >