Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-17 Thread Eric W. Biederman
Christian Brauner  writes:

> On Mon, Aug 17, 2020 at 10:48:01AM -0500, Eric W. Biederman wrote:
>> 
>> Creating names in the kernel for namespaces is very difficult and
>> problematic.  I have not seen anything that looks like  all of the
>> problems have been solved with restoring these new names.
>> 
>> When your filter for your list of namespaces is user namespace creating
>> a new directory in proc is highly questionable.
>> 
>> As everyone uses proc placing this functionality in proc also amplifies
>> the problem of creating names.
>> 
>> 
>> Rather than proc having a way to mount a namespace filesystem filter by
>> the user namespace of the mounter likely to have many many fewer
>> problems.  Especially as we are limiting/not allow new non-process
>> things and ideally finding a way to remove the non-process things.
>> 
>> 
>> Kirill you have a good point that taking the case where a pid namespace
>> does not exist in a user namespace is likely quite unrealistic.
>> 
>> Kirill mentioned upthread that the list of namespaces are the list that
>> can appear in a container.  Except by discipline in creating containers
>> it is not possible to know which namespaces may appear in attached to a
>> process.  It is possible to be very creative with setns, and violate any
>> constraint you may have.  Which means your filtered list of namespaces
>> may not contain all of the namespaces used by a set of processes.  This
>
> Indeed. We use setns() quite creatively when intercepting syscalls and
> when attaching to a container.
>
>> further argues that attaching the list of namespaces to proc does not
>> make sense.
>> 
>> Andrei has a good point that placing the names in a hierarchy by
>> user namespace has the potential to create more freedom when
>> assigning names to namespaces, as it means the names for namespaces
>> do not need to be globally unique, and while still allowing the names
>> to stay the same.
>> 
>> 
>> To recap the possibilities for names for namespaces that I have seen
>> mentioned in this thread are:
>>   - Names per mount
>>   - Names per user namespace
>> 
>> I personally suspect that names per mount are likely to be so flexibly
>> they are confusing, while names per user namespace are likely to be
>> rigid, possibly too rigid to use.
>> 
>> It all depends upon how everything is used.  I have yet to see a
>> complete story of how these names will be generated and used.  So I can
>> not really judge.
>
> So I haven't fully understood either what the motivation for this
> patchset is.
> I can just speak to the use-case I had when I started prototyping
> something similar: We needed a way to get a view on all namespaces
> that exist on the system because we wanted a way to do namespace
> debugging on a live system. This interface could've easily lived in
> debugfs. The main point was that it should contain all namespaces.
> Note, that it wasn't supposed to be a hierarchical format it was only
> mean to list all namespaces and accessible to real root.
> The interface here is way more flexible/complex and I haven't yet
> figured out what exactly it is supposed to be used for.
>
>> 
>> 
>> Let me add another take on this idea that might give this work a path
>> forward. If I were solving this I would explore giving nsfs directories
>> per user namespace, and a way to mount it that exposed the directory of
>> the mounters current user namespace (something like btrfs snapshots).
>> 
>> Hmm.  For the user namespace directory I think I would give it a file
>> "ns" that can be opened to get a file handle on the user namespace.
>> Plus a set of subdirectories "cgroup", "ipc", "mnt", "net", "pid",
>> "user", "uts") for each type of namespace.  In each directory I think
>> I would just have a 64bit counter and each new entry I would assign the
>> next number from that counter.
>> 
>> The restore could either have the ability to rename files or simply the
>> ability to bump the counter (like we do with pids) so the names of the
>> namespaces can be restored.
>> 
>> That winds up making a user namespace the namespace of namespaces, so
>> I am not 100% about the idea. 
>
> I think you're right that we need to understand better what the use-case
> is. If I understand your suggestion correctly it wouldn't allow to show
> nested user namespaces if the nsfs mount is per-user namespace.

So what I was thinking is that we have the user namespace directories
and that the mount code would perform a bind mount such that the
directory that matches the mounters user namespace is the root
directory.

> Let me throw in a crazy idea: couldn't we just make the ioctl_ns() walk
> a namespace hierarchy? For example, you could pass in a user namespace
> fd and then you'd get back a struct with handles for fds for the
> namespaces owned by that user namespace and then you could use
> NS_GET_USERNS/NS_GET_PARENT to walk upwards from the user namespace fd
> passed in initially and so on? Or something similar/simpler. This would

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-17 Thread Eric W. Biederman


Creating names in the kernel for namespaces is very difficult and
problematic.  I have not seen anything that looks like  all of the
problems have been solved with restoring these new names.

When your filter for your list of namespaces is user namespace creating
a new directory in proc is highly questionable.

As everyone uses proc placing this functionality in proc also amplifies
the problem of creating names.


Rather than proc having a way to mount a namespace filesystem filter by
the user namespace of the mounter likely to have many many fewer
problems.  Especially as we are limiting/not allow new non-process
things and ideally finding a way to remove the non-process things.


Kirill you have a good point that taking the case where a pid namespace
does not exist in a user namespace is likely quite unrealistic.

Kirill mentioned upthread that the list of namespaces are the list that
can appear in a container.  Except by discipline in creating containers
it is not possible to know which namespaces may appear in attached to a
process.  It is possible to be very creative with setns, and violate any
constraint you may have.  Which means your filtered list of namespaces
may not contain all of the namespaces used by a set of processes.  This
further argues that attaching the list of namespaces to proc does not
make sense.

Andrei has a good point that placing the names in a hierarchy by
user namespace has the potential to create more freedom when
assigning names to namespaces, as it means the names for namespaces
do not need to be globally unique, and while still allowing the names
to stay the same.


To recap the possibilities for names for namespaces that I have seen
mentioned in this thread are:
  - Names per mount
  - Names per user namespace

I personally suspect that names per mount are likely to be so flexibly
they are confusing, while names per user namespace are likely to be
rigid, possibly too rigid to use.

It all depends upon how everything is used.  I have yet to see a
complete story of how these names will be generated and used.  So I can
not really judge.


Let me add another take on this idea that might give this work a path
forward. If I were solving this I would explore giving nsfs directories
per user namespace, and a way to mount it that exposed the directory of
the mounters current user namespace (something like btrfs snapshots).

Hmm.  For the user namespace directory I think I would give it a file
"ns" that can be opened to get a file handle on the user namespace.
Plus a set of subdirectories "cgroup", "ipc", "mnt", "net", "pid",
"user", "uts") for each type of namespace.  In each directory I think
I would just have a 64bit counter and each new entry I would assign the
next number from that counter.

The restore could either have the ability to rename files or simply the
ability to bump the counter (like we do with pids) so the names of the
namespaces can be restored.

That winds up making a user namespace the namespace of namespaces, so
I am not 100% about the idea. 

Eric




Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-17 Thread Christian Brauner
On Mon, Aug 17, 2020 at 10:48:01AM -0500, Eric W. Biederman wrote:
> 
> Creating names in the kernel for namespaces is very difficult and
> problematic.  I have not seen anything that looks like  all of the
> problems have been solved with restoring these new names.
> 
> When your filter for your list of namespaces is user namespace creating
> a new directory in proc is highly questionable.
> 
> As everyone uses proc placing this functionality in proc also amplifies
> the problem of creating names.
> 
> 
> Rather than proc having a way to mount a namespace filesystem filter by
> the user namespace of the mounter likely to have many many fewer
> problems.  Especially as we are limiting/not allow new non-process
> things and ideally finding a way to remove the non-process things.
> 
> 
> Kirill you have a good point that taking the case where a pid namespace
> does not exist in a user namespace is likely quite unrealistic.
> 
> Kirill mentioned upthread that the list of namespaces are the list that
> can appear in a container.  Except by discipline in creating containers
> it is not possible to know which namespaces may appear in attached to a
> process.  It is possible to be very creative with setns, and violate any
> constraint you may have.  Which means your filtered list of namespaces
> may not contain all of the namespaces used by a set of processes.  This

Indeed. We use setns() quite creatively when intercepting syscalls and
when attaching to a container.

> further argues that attaching the list of namespaces to proc does not
> make sense.
> 
> Andrei has a good point that placing the names in a hierarchy by
> user namespace has the potential to create more freedom when
> assigning names to namespaces, as it means the names for namespaces
> do not need to be globally unique, and while still allowing the names
> to stay the same.
> 
> 
> To recap the possibilities for names for namespaces that I have seen
> mentioned in this thread are:
>   - Names per mount
>   - Names per user namespace
> 
> I personally suspect that names per mount are likely to be so flexibly
> they are confusing, while names per user namespace are likely to be
> rigid, possibly too rigid to use.
> 
> It all depends upon how everything is used.  I have yet to see a
> complete story of how these names will be generated and used.  So I can
> not really judge.

So I haven't fully understood either what the motivation for this
patchset is.
I can just speak to the use-case I had when I started prototyping
something similar: We needed a way to get a view on all namespaces
that exist on the system because we wanted a way to do namespace
debugging on a live system. This interface could've easily lived in
debugfs. The main point was that it should contain all namespaces.
Note, that it wasn't supposed to be a hierarchical format it was only
mean to list all namespaces and accessible to real root.
The interface here is way more flexible/complex and I haven't yet
figured out what exactly it is supposed to be used for.

> 
> 
> Let me add another take on this idea that might give this work a path
> forward. If I were solving this I would explore giving nsfs directories
> per user namespace, and a way to mount it that exposed the directory of
> the mounters current user namespace (something like btrfs snapshots).
> 
> Hmm.  For the user namespace directory I think I would give it a file
> "ns" that can be opened to get a file handle on the user namespace.
> Plus a set of subdirectories "cgroup", "ipc", "mnt", "net", "pid",
> "user", "uts") for each type of namespace.  In each directory I think
> I would just have a 64bit counter and each new entry I would assign the
> next number from that counter.
> 
> The restore could either have the ability to rename files or simply the
> ability to bump the counter (like we do with pids) so the names of the
> namespaces can be restored.
> 
> That winds up making a user namespace the namespace of namespaces, so
> I am not 100% about the idea. 

I think you're right that we need to understand better what the use-case
is. If I understand your suggestion correctly it wouldn't allow to show
nested user namespaces if the nsfs mount is per-user namespace.

Let me throw in a crazy idea: couldn't we just make the ioctl_ns() walk
a namespace hierarchy? For example, you could pass in a user namespace
fd and then you'd get back a struct with handles for fds for the
namespaces owned by that user namespace and then you could use
NS_GET_USERNS/NS_GET_PARENT to walk upwards from the user namespace fd
passed in initially and so on? Or something similar/simpler. This would
also decouple this from procfs somewhat.

Christian


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-17 Thread Kirill Tkhai
On 14.08.2020 22:21, Andrei Vagin wrote:
> On Fri, Aug 14, 2020 at 06:11:58PM +0300, Kirill Tkhai wrote:
>> On 14.08.2020 04:16, Andrei Vagin wrote:
>>> On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
 On 12.08.2020 20:53, Andrei Vagin wrote:
> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
>> On 10.08.2020 20:34, Andrei Vagin wrote:
>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
 On 06.08.2020 11:05, Andrei Vagin wrote:
> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>> On 31.07.2020 01:13, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 On 30.07.2020 17:34, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
>
>> Currently, there is no a way to list or iterate all or subset of 
>> namespaces
>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
>> directories,
>> but some also may be as open files, which are not attached to a 
>> process.
>> When a namespace open fd is sent over unix socket and then 
>> closed, it is
>> impossible to know whether the namespace exists or not.
>>
>> Also, even if namespace is exposed as attached to a process or 
>> as open file,
>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not 
>> fast, because
>> this multiplies at tasks and fds number.
>
> I am very dubious about this.
>
> I have been avoiding exactly this kind of interface because it can
> create rather fundamental problems with checkpoint restart.

 restart/restore :)

> You do have some filtering and the filtering is not based on 
> current.
> Which is good.
>
> A view that is relative to a user namespace might be ok.It 
> almost
> certainly does better as it's own little filesystem than as an 
> extension
> to proc though.
>
> The big thing we want to ensure is that if you migrate you can 
> restore
> everything.  I don't see how you will be able to restore these 
> files
> after migration.  Anything like this without having a complete
> checkpoint/restore story is a non-starter.

 There is no difference between files in /proc/namespaces/ 
 directory and /proc/[pid]/ns/.

 CRIU can restore open files in /proc/[pid]/ns, the same will be 
 with /proc/namespaces/ files.
 As a person who worked deeply for pid_ns and user_ns support in 
 CRIU, I don't see any
 problem here.
>>>
>>> An obvious diffference is that you are adding the inode to the 
>>> inode to
>>> the file name.  Which means that now you really do have to preserve 
>>> the
>>> inode numbers during process migration.
>>>
>>> Which means now we have to do all of the work to make inode number
>>> restoration possible.  Which means now we need to have multiple
>>> instances of nsfs so that we can restore inode numbers.
>>>
>>> I think this is still possible but we have been delaying figuring 
>>> out
>>> how to restore inode numbers long enough that may be actual 
>>> technical
>>> problems making it happen.
>>
>> Yeah, this matters. But it looks like here is not a dead end. We 
>> just need
>> change the names the namespaces are exported to particular fs and to 
>> support
>> rename().
>>
>> Before introduction a principally new filesystem type for this, can't
>> this be solved in current /proc?
>
> do you mean to introduce names for namespaces which users will be able
> to change? By default, this can be uuid.

 Yes, I mean this.

 Currently I won't give a final answer about UUID, but I planned to 
 show some
 default names, which based on namespace type and inode num. Completely 
 custom
 names for any /proc by default will waste too much memory.

 So, I think the good way will be:

 1)Introduce a function, which returns a hash/uuid based on ino, ns 
 type and some static
 random seed, which is generated on boot;

 2)Use the hash/uuid as default names in newly create /proc/namespaces: 
 pid-{hash/uuid(ino, "pid")}

 3)Allow rename, and allocate space only for renamed names.

 Maybe 2 and 3 will be implemented as shrinkab

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-14 Thread Andrei Vagin
On Fri, Aug 14, 2020 at 06:11:58PM +0300, Kirill Tkhai wrote:
> On 14.08.2020 04:16, Andrei Vagin wrote:
> > On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
> >> On 12.08.2020 20:53, Andrei Vagin wrote:
> >>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
>  On 10.08.2020 20:34, Andrei Vagin wrote:
> > On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> >> On 06.08.2020 11:05, Andrei Vagin wrote:
> >>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>  On 31.07.2020 01:13, Eric W. Biederman wrote:
> > Kirill Tkhai  writes:
> >
> >> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>> Kirill Tkhai  writes:
> >>>
>  Currently, there is no a way to list or iterate all or subset of 
>  namespaces
>  in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
>  directories,
>  but some also may be as open files, which are not attached to a 
>  process.
>  When a namespace open fd is sent over unix socket and then 
>  closed, it is
>  impossible to know whether the namespace exists or not.
> 
>  Also, even if namespace is exposed as attached to a process or 
>  as open file,
>  iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not 
>  fast, because
>  this multiplies at tasks and fds number.
> >>>
> >>> I am very dubious about this.
> >>>
> >>> I have been avoiding exactly this kind of interface because it can
> >>> create rather fundamental problems with checkpoint restart.
> >>
> >> restart/restore :)
> >>
> >>> You do have some filtering and the filtering is not based on 
> >>> current.
> >>> Which is good.
> >>>
> >>> A view that is relative to a user namespace might be ok.It 
> >>> almost
> >>> certainly does better as it's own little filesystem than as an 
> >>> extension
> >>> to proc though.
> >>>
> >>> The big thing we want to ensure is that if you migrate you can 
> >>> restore
> >>> everything.  I don't see how you will be able to restore these 
> >>> files
> >>> after migration.  Anything like this without having a complete
> >>> checkpoint/restore story is a non-starter.
> >>
> >> There is no difference between files in /proc/namespaces/ 
> >> directory and /proc/[pid]/ns/.
> >>
> >> CRIU can restore open files in /proc/[pid]/ns, the same will be 
> >> with /proc/namespaces/ files.
> >> As a person who worked deeply for pid_ns and user_ns support in 
> >> CRIU, I don't see any
> >> problem here.
> >
> > An obvious diffference is that you are adding the inode to the 
> > inode to
> > the file name.  Which means that now you really do have to preserve 
> > the
> > inode numbers during process migration.
> >
> > Which means now we have to do all of the work to make inode number
> > restoration possible.  Which means now we need to have multiple
> > instances of nsfs so that we can restore inode numbers.
> >
> > I think this is still possible but we have been delaying figuring 
> > out
> > how to restore inode numbers long enough that may be actual 
> > technical
> > problems making it happen.
> 
>  Yeah, this matters. But it looks like here is not a dead end. We 
>  just need
>  change the names the namespaces are exported to particular fs and to 
>  support
>  rename().
> 
>  Before introduction a principally new filesystem type for this, can't
>  this be solved in current /proc?
> >>>
> >>> do you mean to introduce names for namespaces which users will be able
> >>> to change? By default, this can be uuid.
> >>
> >> Yes, I mean this.
> >>
> >> Currently I won't give a final answer about UUID, but I planned to 
> >> show some
> >> default names, which based on namespace type and inode num. Completely 
> >> custom
> >> names for any /proc by default will waste too much memory.
> >>
> >> So, I think the good way will be:
> >>
> >> 1)Introduce a function, which returns a hash/uuid based on ino, ns 
> >> type and some static
> >> random seed, which is generated on boot;
> >>
> >> 2)Use the hash/uuid as default names in newly create /proc/namespaces: 
> >> pid-{hash/uuid(ino, "pid")}
> >>
> >> 3)Allow rename, and allocate space only for renamed names.
> >>
> >> Maybe 2 and 3 will be implemented as shrinkable dentries and 
> >> non-shrinkable.
> 

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-14 Thread Kirill Tkhai
On 14.08.2020 04:16, Andrei Vagin wrote:
> On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
>> On 12.08.2020 20:53, Andrei Vagin wrote:
>>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
 On 10.08.2020 20:34, Andrei Vagin wrote:
> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
>> On 06.08.2020 11:05, Andrei Vagin wrote:
>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
 On 31.07.2020 01:13, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
>
>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 Currently, there is no a way to list or iterate all or subset of 
 namespaces
 in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
 directories,
 but some also may be as open files, which are not attached to a 
 process.
 When a namespace open fd is sent over unix socket and then closed, 
 it is
 impossible to know whether the namespace exists or not.

 Also, even if namespace is exposed as attached to a process or as 
 open file,
 iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not 
 fast, because
 this multiplies at tasks and fds number.
>>>
>>> I am very dubious about this.
>>>
>>> I have been avoiding exactly this kind of interface because it can
>>> create rather fundamental problems with checkpoint restart.
>>
>> restart/restore :)
>>
>>> You do have some filtering and the filtering is not based on 
>>> current.
>>> Which is good.
>>>
>>> A view that is relative to a user namespace might be ok.It 
>>> almost
>>> certainly does better as it's own little filesystem than as an 
>>> extension
>>> to proc though.
>>>
>>> The big thing we want to ensure is that if you migrate you can 
>>> restore
>>> everything.  I don't see how you will be able to restore these files
>>> after migration.  Anything like this without having a complete
>>> checkpoint/restore story is a non-starter.
>>
>> There is no difference between files in /proc/namespaces/ directory 
>> and /proc/[pid]/ns/.
>>
>> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
>> /proc/namespaces/ files.
>> As a person who worked deeply for pid_ns and user_ns support in 
>> CRIU, I don't see any
>> problem here.
>
> An obvious diffference is that you are adding the inode to the inode 
> to
> the file name.  Which means that now you really do have to preserve 
> the
> inode numbers during process migration.
>
> Which means now we have to do all of the work to make inode number
> restoration possible.  Which means now we need to have multiple
> instances of nsfs so that we can restore inode numbers.
>
> I think this is still possible but we have been delaying figuring out
> how to restore inode numbers long enough that may be actual technical
> problems making it happen.

 Yeah, this matters. But it looks like here is not a dead end. We just 
 need
 change the names the namespaces are exported to particular fs and to 
 support
 rename().

 Before introduction a principally new filesystem type for this, can't
 this be solved in current /proc?
>>>
>>> do you mean to introduce names for namespaces which users will be able
>>> to change? By default, this can be uuid.
>>
>> Yes, I mean this.
>>
>> Currently I won't give a final answer about UUID, but I planned to show 
>> some
>> default names, which based on namespace type and inode num. Completely 
>> custom
>> names for any /proc by default will waste too much memory.
>>
>> So, I think the good way will be:
>>
>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type 
>> and some static
>> random seed, which is generated on boot;
>>
>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: 
>> pid-{hash/uuid(ino, "pid")}
>>
>> 3)Allow rename, and allocate space only for renamed names.
>>
>> Maybe 2 and 3 will be implemented as shrinkable dentries and 
>> non-shrinkable.
>>
>>> And I have a suggestion about the structure of /proc/namespaces/.
>>>
>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
>>> to group namespaces by their user-namespaces?
>>>
>>> /proc/namespaces/
>>>  user
>>>  mnt-X
>

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-13 Thread Andrei Vagin
On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
> On 12.08.2020 20:53, Andrei Vagin wrote:
> > On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
> >> On 10.08.2020 20:34, Andrei Vagin wrote:
> >>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
>  On 06.08.2020 11:05, Andrei Vagin wrote:
> > On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>> Kirill Tkhai  writes:
> >>>
>  On 30.07.2020 17:34, Eric W. Biederman wrote:
> > Kirill Tkhai  writes:
> >
> >> Currently, there is no a way to list or iterate all or subset of 
> >> namespaces
> >> in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
> >> directories,
> >> but some also may be as open files, which are not attached to a 
> >> process.
> >> When a namespace open fd is sent over unix socket and then closed, 
> >> it is
> >> impossible to know whether the namespace exists or not.
> >>
> >> Also, even if namespace is exposed as attached to a process or as 
> >> open file,
> >> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not 
> >> fast, because
> >> this multiplies at tasks and fds number.
> >
> > I am very dubious about this.
> >
> > I have been avoiding exactly this kind of interface because it can
> > create rather fundamental problems with checkpoint restart.
> 
>  restart/restore :)
> 
> > You do have some filtering and the filtering is not based on 
> > current.
> > Which is good.
> >
> > A view that is relative to a user namespace might be ok.It 
> > almost
> > certainly does better as it's own little filesystem than as an 
> > extension
> > to proc though.
> >
> > The big thing we want to ensure is that if you migrate you can 
> > restore
> > everything.  I don't see how you will be able to restore these files
> > after migration.  Anything like this without having a complete
> > checkpoint/restore story is a non-starter.
> 
>  There is no difference between files in /proc/namespaces/ directory 
>  and /proc/[pid]/ns/.
> 
>  CRIU can restore open files in /proc/[pid]/ns, the same will be with 
>  /proc/namespaces/ files.
>  As a person who worked deeply for pid_ns and user_ns support in 
>  CRIU, I don't see any
>  problem here.
> >>>
> >>> An obvious diffference is that you are adding the inode to the inode 
> >>> to
> >>> the file name.  Which means that now you really do have to preserve 
> >>> the
> >>> inode numbers during process migration.
> >>>
> >>> Which means now we have to do all of the work to make inode number
> >>> restoration possible.  Which means now we need to have multiple
> >>> instances of nsfs so that we can restore inode numbers.
> >>>
> >>> I think this is still possible but we have been delaying figuring out
> >>> how to restore inode numbers long enough that may be actual technical
> >>> problems making it happen.
> >>
> >> Yeah, this matters. But it looks like here is not a dead end. We just 
> >> need
> >> change the names the namespaces are exported to particular fs and to 
> >> support
> >> rename().
> >>
> >> Before introduction a principally new filesystem type for this, can't
> >> this be solved in current /proc?
> >
> > do you mean to introduce names for namespaces which users will be able
> > to change? By default, this can be uuid.
> 
>  Yes, I mean this.
> 
>  Currently I won't give a final answer about UUID, but I planned to show 
>  some
>  default names, which based on namespace type and inode num. Completely 
>  custom
>  names for any /proc by default will waste too much memory.
> 
>  So, I think the good way will be:
> 
>  1)Introduce a function, which returns a hash/uuid based on ino, ns type 
>  and some static
>  random seed, which is generated on boot;
> 
>  2)Use the hash/uuid as default names in newly create /proc/namespaces: 
>  pid-{hash/uuid(ino, "pid")}
> 
>  3)Allow rename, and allocate space only for renamed names.
> 
>  Maybe 2 and 3 will be implemented as shrinkable dentries and 
>  non-shrinkable.
> 
> > And I have a suggestion about the structure of /proc/namespaces/.
> >
> > Each namespace is owned by one of user namespaces. Maybe it makes sense
> > to group namespaces by their user-namespaces?
> >
> > /proc/namespaces/
> >  user
> >  mnt-X
> >  mnt-Y
> >   

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-13 Thread Kirill Tkhai
On 12.08.2020 20:53, Andrei Vagin wrote:
> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
>> On 10.08.2020 20:34, Andrei Vagin wrote:
>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
 On 06.08.2020 11:05, Andrei Vagin wrote:
> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>> On 31.07.2020 01:13, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 On 30.07.2020 17:34, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
>
>> Currently, there is no a way to list or iterate all or subset of 
>> namespaces
>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
>> directories,
>> but some also may be as open files, which are not attached to a 
>> process.
>> When a namespace open fd is sent over unix socket and then closed, 
>> it is
>> impossible to know whether the namespace exists or not.
>>
>> Also, even if namespace is exposed as attached to a process or as 
>> open file,
>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
>> because
>> this multiplies at tasks and fds number.
>
> I am very dubious about this.
>
> I have been avoiding exactly this kind of interface because it can
> create rather fundamental problems with checkpoint restart.

 restart/restore :)

> You do have some filtering and the filtering is not based on current.
> Which is good.
>
> A view that is relative to a user namespace might be ok.It almost
> certainly does better as it's own little filesystem than as an 
> extension
> to proc though.
>
> The big thing we want to ensure is that if you migrate you can restore
> everything.  I don't see how you will be able to restore these files
> after migration.  Anything like this without having a complete
> checkpoint/restore story is a non-starter.

 There is no difference between files in /proc/namespaces/ directory 
 and /proc/[pid]/ns/.

 CRIU can restore open files in /proc/[pid]/ns, the same will be with 
 /proc/namespaces/ files.
 As a person who worked deeply for pid_ns and user_ns support in CRIU, 
 I don't see any
 problem here.
>>>
>>> An obvious diffference is that you are adding the inode to the inode to
>>> the file name.  Which means that now you really do have to preserve the
>>> inode numbers during process migration.
>>>
>>> Which means now we have to do all of the work to make inode number
>>> restoration possible.  Which means now we need to have multiple
>>> instances of nsfs so that we can restore inode numbers.
>>>
>>> I think this is still possible but we have been delaying figuring out
>>> how to restore inode numbers long enough that may be actual technical
>>> problems making it happen.
>>
>> Yeah, this matters. But it looks like here is not a dead end. We just 
>> need
>> change the names the namespaces are exported to particular fs and to 
>> support
>> rename().
>>
>> Before introduction a principally new filesystem type for this, can't
>> this be solved in current /proc?
>
> do you mean to introduce names for namespaces which users will be able
> to change? By default, this can be uuid.

 Yes, I mean this.

 Currently I won't give a final answer about UUID, but I planned to show 
 some
 default names, which based on namespace type and inode num. Completely 
 custom
 names for any /proc by default will waste too much memory.

 So, I think the good way will be:

 1)Introduce a function, which returns a hash/uuid based on ino, ns type 
 and some static
 random seed, which is generated on boot;

 2)Use the hash/uuid as default names in newly create /proc/namespaces: 
 pid-{hash/uuid(ino, "pid")}

 3)Allow rename, and allocate space only for renamed names.

 Maybe 2 and 3 will be implemented as shrinkable dentries and 
 non-shrinkable.

> And I have a suggestion about the structure of /proc/namespaces/.
>
> Each namespace is owned by one of user namespaces. Maybe it makes sense
> to group namespaces by their user-namespaces?
>
> /proc/namespaces/
>  user
>  mnt-X
>  mnt-Y
>  pid-X
>  uts-Z
>  user-X/
> user
> mnt-A
> mnt-B
> user-C
> user-C/
>user
>  user-Y/
>

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-12 Thread Andrei Vagin
On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
> On 10.08.2020 20:34, Andrei Vagin wrote:
> > On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> >> On 06.08.2020 11:05, Andrei Vagin wrote:
> >>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>  On 31.07.2020 01:13, Eric W. Biederman wrote:
> > Kirill Tkhai  writes:
> >
> >> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>> Kirill Tkhai  writes:
> >>>
>  Currently, there is no a way to list or iterate all or subset of 
>  namespaces
>  in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
>  directories,
>  but some also may be as open files, which are not attached to a 
>  process.
>  When a namespace open fd is sent over unix socket and then closed, 
>  it is
>  impossible to know whether the namespace exists or not.
> 
>  Also, even if namespace is exposed as attached to a process or as 
>  open file,
>  iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
>  because
>  this multiplies at tasks and fds number.
> >>>
> >>> I am very dubious about this.
> >>>
> >>> I have been avoiding exactly this kind of interface because it can
> >>> create rather fundamental problems with checkpoint restart.
> >>
> >> restart/restore :)
> >>
> >>> You do have some filtering and the filtering is not based on current.
> >>> Which is good.
> >>>
> >>> A view that is relative to a user namespace might be ok.It almost
> >>> certainly does better as it's own little filesystem than as an 
> >>> extension
> >>> to proc though.
> >>>
> >>> The big thing we want to ensure is that if you migrate you can restore
> >>> everything.  I don't see how you will be able to restore these files
> >>> after migration.  Anything like this without having a complete
> >>> checkpoint/restore story is a non-starter.
> >>
> >> There is no difference between files in /proc/namespaces/ directory 
> >> and /proc/[pid]/ns/.
> >>
> >> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
> >> /proc/namespaces/ files.
> >> As a person who worked deeply for pid_ns and user_ns support in CRIU, 
> >> I don't see any
> >> problem here.
> >
> > An obvious diffference is that you are adding the inode to the inode to
> > the file name.  Which means that now you really do have to preserve the
> > inode numbers during process migration.
> >
> > Which means now we have to do all of the work to make inode number
> > restoration possible.  Which means now we need to have multiple
> > instances of nsfs so that we can restore inode numbers.
> >
> > I think this is still possible but we have been delaying figuring out
> > how to restore inode numbers long enough that may be actual technical
> > problems making it happen.
> 
>  Yeah, this matters. But it looks like here is not a dead end. We just 
>  need
>  change the names the namespaces are exported to particular fs and to 
>  support
>  rename().
> 
>  Before introduction a principally new filesystem type for this, can't
>  this be solved in current /proc?
> >>>
> >>> do you mean to introduce names for namespaces which users will be able
> >>> to change? By default, this can be uuid.
> >>
> >> Yes, I mean this.
> >>
> >> Currently I won't give a final answer about UUID, but I planned to show 
> >> some
> >> default names, which based on namespace type and inode num. Completely 
> >> custom
> >> names for any /proc by default will waste too much memory.
> >>
> >> So, I think the good way will be:
> >>
> >> 1)Introduce a function, which returns a hash/uuid based on ino, ns type 
> >> and some static
> >> random seed, which is generated on boot;
> >>
> >> 2)Use the hash/uuid as default names in newly create /proc/namespaces: 
> >> pid-{hash/uuid(ino, "pid")}
> >>
> >> 3)Allow rename, and allocate space only for renamed names.
> >>
> >> Maybe 2 and 3 will be implemented as shrinkable dentries and 
> >> non-shrinkable.
> >>
> >>> And I have a suggestion about the structure of /proc/namespaces/.
> >>>
> >>> Each namespace is owned by one of user namespaces. Maybe it makes sense
> >>> to group namespaces by their user-namespaces?
> >>>
> >>> /proc/namespaces/
> >>>  user
> >>>  mnt-X
> >>>  mnt-Y
> >>>  pid-X
> >>>  uts-Z
> >>>  user-X/
> >>> user
> >>> mnt-A
> >>> mnt-B
> >>> user-C
> >>> user-C/
> >>>user
> >>>  user-Y/
> >>> user
> >>
> >> Hm, I don't think th

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-11 Thread Kirill Tkhai
On 10.08.2020 20:34, Andrei Vagin wrote:
> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
>> On 06.08.2020 11:05, Andrei Vagin wrote:
>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
 On 31.07.2020 01:13, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
>
>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 Currently, there is no a way to list or iterate all or subset of 
 namespaces
 in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
 directories,
 but some also may be as open files, which are not attached to a 
 process.
 When a namespace open fd is sent over unix socket and then closed, it 
 is
 impossible to know whether the namespace exists or not.

 Also, even if namespace is exposed as attached to a process or as open 
 file,
 iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
 because
 this multiplies at tasks and fds number.
>>>
>>> I am very dubious about this.
>>>
>>> I have been avoiding exactly this kind of interface because it can
>>> create rather fundamental problems with checkpoint restart.
>>
>> restart/restore :)
>>
>>> You do have some filtering and the filtering is not based on current.
>>> Which is good.
>>>
>>> A view that is relative to a user namespace might be ok.It almost
>>> certainly does better as it's own little filesystem than as an extension
>>> to proc though.
>>>
>>> The big thing we want to ensure is that if you migrate you can restore
>>> everything.  I don't see how you will be able to restore these files
>>> after migration.  Anything like this without having a complete
>>> checkpoint/restore story is a non-starter.
>>
>> There is no difference between files in /proc/namespaces/ directory and 
>> /proc/[pid]/ns/.
>>
>> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
>> /proc/namespaces/ files.
>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I 
>> don't see any
>> problem here.
>
> An obvious diffference is that you are adding the inode to the inode to
> the file name.  Which means that now you really do have to preserve the
> inode numbers during process migration.
>
> Which means now we have to do all of the work to make inode number
> restoration possible.  Which means now we need to have multiple
> instances of nsfs so that we can restore inode numbers.
>
> I think this is still possible but we have been delaying figuring out
> how to restore inode numbers long enough that may be actual technical
> problems making it happen.

 Yeah, this matters. But it looks like here is not a dead end. We just need
 change the names the namespaces are exported to particular fs and to 
 support
 rename().

 Before introduction a principally new filesystem type for this, can't
 this be solved in current /proc?
>>>
>>> do you mean to introduce names for namespaces which users will be able
>>> to change? By default, this can be uuid.
>>
>> Yes, I mean this.
>>
>> Currently I won't give a final answer about UUID, but I planned to show some
>> default names, which based on namespace type and inode num. Completely custom
>> names for any /proc by default will waste too much memory.
>>
>> So, I think the good way will be:
>>
>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and 
>> some static
>> random seed, which is generated on boot;
>>
>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: 
>> pid-{hash/uuid(ino, "pid")}
>>
>> 3)Allow rename, and allocate space only for renamed names.
>>
>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
>>
>>> And I have a suggestion about the structure of /proc/namespaces/.
>>>
>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
>>> to group namespaces by their user-namespaces?
>>>
>>> /proc/namespaces/
>>>  user
>>>  mnt-X
>>>  mnt-Y
>>>  pid-X
>>>  uts-Z
>>>  user-X/
>>> user
>>> mnt-A
>>> mnt-B
>>> user-C
>>> user-C/
>>>user
>>>  user-Y/
>>> user
>>
>> Hm, I don't think that user namespace is a generic key value for everybody.
>> For generic people tasks a user namespace is just a namespace among another
>> namespace types. For me it will look a bit strage to iterate some user 
>> namespaces
>> to build container net topology.
> 
> I can’t agree with you that the user namespace is one of others. It is

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-10 Thread Andrei Vagin
On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> On 06.08.2020 11:05, Andrei Vagin wrote:
> > On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>> Kirill Tkhai  writes:
> >>>
>  On 30.07.2020 17:34, Eric W. Biederman wrote:
> > Kirill Tkhai  writes:
> >
> >> Currently, there is no a way to list or iterate all or subset of 
> >> namespaces
> >> in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
> >> directories,
> >> but some also may be as open files, which are not attached to a 
> >> process.
> >> When a namespace open fd is sent over unix socket and then closed, it 
> >> is
> >> impossible to know whether the namespace exists or not.
> >>
> >> Also, even if namespace is exposed as attached to a process or as open 
> >> file,
> >> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
> >> because
> >> this multiplies at tasks and fds number.
> >
> > I am very dubious about this.
> >
> > I have been avoiding exactly this kind of interface because it can
> > create rather fundamental problems with checkpoint restart.
> 
>  restart/restore :)
> 
> > You do have some filtering and the filtering is not based on current.
> > Which is good.
> >
> > A view that is relative to a user namespace might be ok.It almost
> > certainly does better as it's own little filesystem than as an extension
> > to proc though.
> >
> > The big thing we want to ensure is that if you migrate you can restore
> > everything.  I don't see how you will be able to restore these files
> > after migration.  Anything like this without having a complete
> > checkpoint/restore story is a non-starter.
> 
>  There is no difference between files in /proc/namespaces/ directory and 
>  /proc/[pid]/ns/.
> 
>  CRIU can restore open files in /proc/[pid]/ns, the same will be with 
>  /proc/namespaces/ files.
>  As a person who worked deeply for pid_ns and user_ns support in CRIU, I 
>  don't see any
>  problem here.
> >>>
> >>> An obvious diffference is that you are adding the inode to the inode to
> >>> the file name.  Which means that now you really do have to preserve the
> >>> inode numbers during process migration.
> >>>
> >>> Which means now we have to do all of the work to make inode number
> >>> restoration possible.  Which means now we need to have multiple
> >>> instances of nsfs so that we can restore inode numbers.
> >>>
> >>> I think this is still possible but we have been delaying figuring out
> >>> how to restore inode numbers long enough that may be actual technical
> >>> problems making it happen.
> >>
> >> Yeah, this matters. But it looks like here is not a dead end. We just need
> >> change the names the namespaces are exported to particular fs and to 
> >> support
> >> rename().
> >>
> >> Before introduction a principally new filesystem type for this, can't
> >> this be solved in current /proc?
> > 
> > do you mean to introduce names for namespaces which users will be able
> > to change? By default, this can be uuid.
> 
> Yes, I mean this.
> 
> Currently I won't give a final answer about UUID, but I planned to show some
> default names, which based on namespace type and inode num. Completely custom
> names for any /proc by default will waste too much memory.
> 
> So, I think the good way will be:
> 
> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and 
> some static
> random seed, which is generated on boot;
> 
> 2)Use the hash/uuid as default names in newly create /proc/namespaces: 
> pid-{hash/uuid(ino, "pid")}
> 
> 3)Allow rename, and allocate space only for renamed names.
> 
> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
> 
> > And I have a suggestion about the structure of /proc/namespaces/.
> > 
> > Each namespace is owned by one of user namespaces. Maybe it makes sense
> > to group namespaces by their user-namespaces?
> > 
> > /proc/namespaces/
> >  user
> >  mnt-X
> >  mnt-Y
> >  pid-X
> >  uts-Z
> >  user-X/
> > user
> > mnt-A
> > mnt-B
> > user-C
> > user-C/
> >user
> >  user-Y/
> > user
> 
> Hm, I don't think that user namespace is a generic key value for everybody.
> For generic people tasks a user namespace is just a namespace among another
> namespace types. For me it will look a bit strage to iterate some user 
> namespaces
> to build container net topology.

I can’t agree with you that the user namespace is one of others. It is
the namespace for namespaces. It sets security boundaries in 

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-07 Thread Kirill Tkhai
On 06.08.2020 11:05, Andrei Vagin wrote:
> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>> On 31.07.2020 01:13, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 On 30.07.2020 17:34, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
>
>> Currently, there is no a way to list or iterate all or subset of 
>> namespaces
>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
>> directories,
>> but some also may be as open files, which are not attached to a process.
>> When a namespace open fd is sent over unix socket and then closed, it is
>> impossible to know whether the namespace exists or not.
>>
>> Also, even if namespace is exposed as attached to a process or as open 
>> file,
>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
>> because
>> this multiplies at tasks and fds number.
>
> I am very dubious about this.
>
> I have been avoiding exactly this kind of interface because it can
> create rather fundamental problems with checkpoint restart.

 restart/restore :)

> You do have some filtering and the filtering is not based on current.
> Which is good.
>
> A view that is relative to a user namespace might be ok.It almost
> certainly does better as it's own little filesystem than as an extension
> to proc though.
>
> The big thing we want to ensure is that if you migrate you can restore
> everything.  I don't see how you will be able to restore these files
> after migration.  Anything like this without having a complete
> checkpoint/restore story is a non-starter.

 There is no difference between files in /proc/namespaces/ directory and 
 /proc/[pid]/ns/.

 CRIU can restore open files in /proc/[pid]/ns, the same will be with 
 /proc/namespaces/ files.
 As a person who worked deeply for pid_ns and user_ns support in CRIU, I 
 don't see any
 problem here.
>>>
>>> An obvious diffference is that you are adding the inode to the inode to
>>> the file name.  Which means that now you really do have to preserve the
>>> inode numbers during process migration.
>>>
>>> Which means now we have to do all of the work to make inode number
>>> restoration possible.  Which means now we need to have multiple
>>> instances of nsfs so that we can restore inode numbers.
>>>
>>> I think this is still possible but we have been delaying figuring out
>>> how to restore inode numbers long enough that may be actual technical
>>> problems making it happen.
>>
>> Yeah, this matters. But it looks like here is not a dead end. We just need
>> change the names the namespaces are exported to particular fs and to support
>> rename().
>>
>> Before introduction a principally new filesystem type for this, can't
>> this be solved in current /proc?
> 
> do you mean to introduce names for namespaces which users will be able
> to change? By default, this can be uuid.

Yes, I mean this.

Currently I won't give a final answer about UUID, but I planned to show some
default names, which based on namespace type and inode num. Completely custom
names for any /proc by default will waste too much memory.

So, I think the good way will be:

1)Introduce a function, which returns a hash/uuid based on ino, ns type and 
some static
random seed, which is generated on boot;

2)Use the hash/uuid as default names in newly create /proc/namespaces: 
pid-{hash/uuid(ino, "pid")}

3)Allow rename, and allocate space only for renamed names.

Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.

> And I have a suggestion about the structure of /proc/namespaces/.
> 
> Each namespace is owned by one of user namespaces. Maybe it makes sense
> to group namespaces by their user-namespaces?
> 
> /proc/namespaces/
>  user
>  mnt-X
>  mnt-Y
>  pid-X
>  uts-Z
>  user-X/
> user
> mnt-A
> mnt-B
> user-C
> user-C/
>user
>  user-Y/
> user

Hm, I don't think that user namespace is a generic key value for everybody.
For generic people tasks a user namespace is just a namespace among another
namespace types. For me it will look a bit strage to iterate some user 
namespaces
to build container net topology.

> Do we try to invent cgroupfs for namespaces?

Could you clarify your thought?


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-06 Thread Andrei Vagin
On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> On 31.07.2020 01:13, Eric W. Biederman wrote:
> > Kirill Tkhai  writes:
> > 
> >> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>> Kirill Tkhai  writes:
> >>>
>  Currently, there is no a way to list or iterate all or subset of 
>  namespaces
>  in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
>  directories,
>  but some also may be as open files, which are not attached to a process.
>  When a namespace open fd is sent over unix socket and then closed, it is
>  impossible to know whether the namespace exists or not.
> 
>  Also, even if namespace is exposed as attached to a process or as open 
>  file,
>  iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
>  because
>  this multiplies at tasks and fds number.
> >>>
> >>> I am very dubious about this.
> >>>
> >>> I have been avoiding exactly this kind of interface because it can
> >>> create rather fundamental problems with checkpoint restart.
> >>
> >> restart/restore :)
> >>
> >>> You do have some filtering and the filtering is not based on current.
> >>> Which is good.
> >>>
> >>> A view that is relative to a user namespace might be ok.It almost
> >>> certainly does better as it's own little filesystem than as an extension
> >>> to proc though.
> >>>
> >>> The big thing we want to ensure is that if you migrate you can restore
> >>> everything.  I don't see how you will be able to restore these files
> >>> after migration.  Anything like this without having a complete
> >>> checkpoint/restore story is a non-starter.
> >>
> >> There is no difference between files in /proc/namespaces/ directory and 
> >> /proc/[pid]/ns/.
> >>
> >> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
> >> /proc/namespaces/ files.
> >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I 
> >> don't see any
> >> problem here.
> > 
> > An obvious diffference is that you are adding the inode to the inode to
> > the file name.  Which means that now you really do have to preserve the
> > inode numbers during process migration.
> >
> > Which means now we have to do all of the work to make inode number
> > restoration possible.  Which means now we need to have multiple
> > instances of nsfs so that we can restore inode numbers.
> > 
> > I think this is still possible but we have been delaying figuring out
> > how to restore inode numbers long enough that may be actual technical
> > problems making it happen.
> 
> Yeah, this matters. But it looks like here is not a dead end. We just need
> change the names the namespaces are exported to particular fs and to support
> rename().
> 
> Before introduction a principally new filesystem type for this, can't
> this be solved in current /proc?

do you mean to introduce names for namespaces which users will be able
to change? By default, this can be uuid.

And I have a suggestion about the structure of /proc/namespaces/.

Each namespace is owned by one of user namespaces. Maybe it makes sense
to group namespaces by their user-namespaces?

/proc/namespaces/
 user
 mnt-X
 mnt-Y
 pid-X
 uts-Z
 user-X/
user
mnt-A
mnt-B
user-C
user-C/
   user
 user-Y/
user

Do we try to invent cgroupfs for namespaces?

Thanks,
Andrei


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-04 Thread Kirill Tkhai
On 04.08.2020 08:43, Andrei Vagin wrote:
> On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote:
>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 Currently, there is no a way to list or iterate all or subset of namespaces
 in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
 but some also may be as open files, which are not attached to a process.
 When a namespace open fd is sent over unix socket and then closed, it is
 impossible to know whether the namespace exists or not.

 Also, even if namespace is exposed as attached to a process or as open 
 file,
 iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
 this multiplies at tasks and fds number.
> 
> Could you describe with more details when you need to iterate
> namespaces?
> 
> There are three ways to hold namespaces.
> 
> * processes
> * bind-mounts
> * file descriptors
> 
> When CRIU dumps a container, it enumirates all processes, collects file
> descriptors and mounts. This means that we will be able to collect all
> namespaces, doesn't it?

1)It's not only for CRIU. No one util can read content of another task unix 
socket like CRIU does.
  Sometimes we may just want to see all mount namespaces to found a mount, 
which owns a reference on
  a device.
2)In case of CRIU, recursive dump (when you iterate unix socket content, then 
you find another
  namespace and iterate another unix socket content, then you find one more 
namespace) is less
  effective and less fast, then dumping different types sequentially: first 
namespaces, second fds, etc.
3)It's still impossible to collect all namespaces like Pasha wrote.


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-04 Thread Pavel Tikhomirov




On 8/4/20 8:43 AM, Andrei Vagin wrote:

On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote:

On 30.07.2020 17:34, Eric W. Biederman wrote:

Kirill Tkhai  writes:


Currently, there is no a way to list or iterate all or subset of namespaces
in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
but some also may be as open files, which are not attached to a process.
When a namespace open fd is sent over unix socket and then closed, it is
impossible to know whether the namespace exists or not.

Also, even if namespace is exposed as attached to a process or as open file,
iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
this multiplies at tasks and fds number.


Could you describe with more details when you need to iterate
namespaces?

There are three ways to hold namespaces.

* processes
* bind-mounts
* file descriptors

When CRIU dumps a container, it enumirates all processes, collects file
descriptors and mounts. This means that we will be able to collect all
namespaces, doesn't it?


Yes we can. But it would be much easier for us to have all namespaces in 
one place isn't it?


And this patch-set has another non-CRIU use case. It can simplify a view 
to namespaces for a normal user. Lets consider some cases:


Lets assume we have an empty (no processes) mount namespace M which is 
held by single open fd, which was put in a unix socket and closed, unix 
socket has single open fd to it which was in it's turn put to another 
unix socket and again and again until we reach unix socket max depth... 
How should normal user find this mount namespace M?


Lets assume that M also has a nsfs bindmount which helds some empty 
network namespace N... How should normal user find N?


Lets also assume that M has overmounted "/":

mount -t tmpfs tmpfs /

Now if you would enter M you would see single tmpfs (because of implicit 
chroot to overmount on setns) in mountinfo and there is no way to see 
full mountinfo if you does not know real root dentry... How should 
normal user (or even CRIU) find N?


So my personal opinion is that we need this interface, maybe it should 
be done somehow different but we need it.






--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-03 Thread Andrei Vagin
On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote:
> On 30.07.2020 17:34, Eric W. Biederman wrote:
> > Kirill Tkhai  writes:
> > 
> >> Currently, there is no a way to list or iterate all or subset of namespaces
> >> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >> but some also may be as open files, which are not attached to a process.
> >> When a namespace open fd is sent over unix socket and then closed, it is
> >> impossible to know whether the namespace exists or not.
> >>
> >> Also, even if namespace is exposed as attached to a process or as open 
> >> file,
> >> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >> this multiplies at tasks and fds number.

Could you describe with more details when you need to iterate
namespaces?

There are three ways to hold namespaces.

* processes
* bind-mounts
* file descriptors

When CRIU dumps a container, it enumirates all processes, collects file
descriptors and mounts. This means that we will be able to collect all
namespaces, doesn't it?


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-03 Thread Alexey Dobriyan
On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> On 31.07.2020 01:13, Eric W. Biederman wrote:
> > Kirill Tkhai  writes:
> > 
> >> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>> Kirill Tkhai  writes:
> >>>
>  Currently, there is no a way to list or iterate all or subset of 
>  namespaces
>  in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
>  directories,
>  but some also may be as open files, which are not attached to a process.
>  When a namespace open fd is sent over unix socket and then closed, it is
>  impossible to know whether the namespace exists or not.
> 
>  Also, even if namespace is exposed as attached to a process or as open 
>  file,
>  iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
>  because
>  this multiplies at tasks and fds number.
> >>>
> >>> I am very dubious about this.
> >>>
> >>> I have been avoiding exactly this kind of interface because it can
> >>> create rather fundamental problems with checkpoint restart.
> >>
> >> restart/restore :)
> >>
> >>> You do have some filtering and the filtering is not based on current.
> >>> Which is good.
> >>>
> >>> A view that is relative to a user namespace might be ok.It almost
> >>> certainly does better as it's own little filesystem than as an extension
> >>> to proc though.
> >>>
> >>> The big thing we want to ensure is that if you migrate you can restore
> >>> everything.  I don't see how you will be able to restore these files
> >>> after migration.  Anything like this without having a complete
> >>> checkpoint/restore story is a non-starter.
> >>
> >> There is no difference between files in /proc/namespaces/ directory and 
> >> /proc/[pid]/ns/.
> >>
> >> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
> >> /proc/namespaces/ files.
> >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I 
> >> don't see any
> >> problem here.
> > 
> > An obvious diffference is that you are adding the inode to the inode to
> > the file name.  Which means that now you really do have to preserve the
> > inode numbers during process migration.
> >
> > Which means now we have to do all of the work to make inode number
> > restoration possible.  Which means now we need to have multiple
> > instances of nsfs so that we can restore inode numbers.
> > 
> > I think this is still possible but we have been delaying figuring out
> > how to restore inode numbers long enough that may be actual technical
> > problems making it happen.
> 
> Yeah, this matters. But it looks like here is not a dead end. We just need
> change the names the namespaces are exported to particular fs and to support
> rename().
> 
> Before introduction a principally new filesystem type for this, can't
> this be solved in current /proc?
> 
> Alexey, does rename() is prohibited for /proc fs?

Techically it is allowed: add ->rename to /proc/ns inode.
But nobody does it.


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-03 Thread Kirill Tkhai
On 31.07.2020 01:13, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 Currently, there is no a way to list or iterate all or subset of namespaces
 in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
 but some also may be as open files, which are not attached to a process.
 When a namespace open fd is sent over unix socket and then closed, it is
 impossible to know whether the namespace exists or not.

 Also, even if namespace is exposed as attached to a process or as open 
 file,
 iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
 this multiplies at tasks and fds number.
>>>
>>> I am very dubious about this.
>>>
>>> I have been avoiding exactly this kind of interface because it can
>>> create rather fundamental problems with checkpoint restart.
>>
>> restart/restore :)
>>
>>> You do have some filtering and the filtering is not based on current.
>>> Which is good.
>>>
>>> A view that is relative to a user namespace might be ok.It almost
>>> certainly does better as it's own little filesystem than as an extension
>>> to proc though.
>>>
>>> The big thing we want to ensure is that if you migrate you can restore
>>> everything.  I don't see how you will be able to restore these files
>>> after migration.  Anything like this without having a complete
>>> checkpoint/restore story is a non-starter.
>>
>> There is no difference between files in /proc/namespaces/ directory and 
>> /proc/[pid]/ns/.
>>
>> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
>> /proc/namespaces/ files.
>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I 
>> don't see any
>> problem here.
> 
> An obvious diffference is that you are adding the inode to the inode to
> the file name.  Which means that now you really do have to preserve the
> inode numbers during process migration.
>
> Which means now we have to do all of the work to make inode number
> restoration possible.  Which means now we need to have multiple
> instances of nsfs so that we can restore inode numbers.
> 
> I think this is still possible but we have been delaying figuring out
> how to restore inode numbers long enough that may be actual technical
> problems making it happen.

Yeah, this matters. But it looks like here is not a dead end. We just need
change the names the namespaces are exported to particular fs and to support
rename().

Before introduction a principally new filesystem type for this, can't
this be solved in current /proc?

Alexey, does rename() is prohibited for /proc fs?
 
> Now maybe CRIU can handle the names of the files changing during
> migration but you have just increased the level of difficulty for doing
> that.
> 
>> If you have a specific worries about, let's discuss them.
> 
> I was asking and I am asking that it be described in the patch
> description how a container using this feature can be migrated
> from one machine to another.  This code is so close to being problematic
> that we need be very careful we don't fundamentally break CRIU while
> trying to make it's job simpler and easier.
> 
>> CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces 
>> C/R.
>>  
>>> Further by not going through the processes it looks like you are
>>> bypassing the existing permission checks.  Which has the potential
>>> to allow someone to use a namespace who would not be able to otherwise.
>>
>> I agree, and I wrote to Christian, that permissions should be more strict.
>> This just should be formalized. Let's discuss this.
>>
>>> So I think this goes one step too far but I am willing to be persuaded
>>> otherwise.
>>>
> 
> Eric
> 



Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-07-31 Thread Pavel Tikhomirov




On 7/31/20 1:13 AM, Eric W. Biederman wrote:

Kirill Tkhai  writes:


On 30.07.2020 17:34, Eric W. Biederman wrote:

Kirill Tkhai  writes:


Currently, there is no a way to list or iterate all or subset of namespaces
in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
but some also may be as open files, which are not attached to a process.
When a namespace open fd is sent over unix socket and then closed, it is
impossible to know whether the namespace exists or not.

Also, even if namespace is exposed as attached to a process or as open file,
iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
this multiplies at tasks and fds number.


I am very dubious about this.

I have been avoiding exactly this kind of interface because it can
create rather fundamental problems with checkpoint restart.


restart/restore :)


You do have some filtering and the filtering is not based on current.
Which is good.

A view that is relative to a user namespace might be ok.It almost
certainly does better as it's own little filesystem than as an extension
to proc though.

The big thing we want to ensure is that if you migrate you can restore
everything.  I don't see how you will be able to restore these files
after migration.  Anything like this without having a complete
checkpoint/restore story is a non-starter.


There is no difference between files in /proc/namespaces/ directory and 
/proc/[pid]/ns/.

CRIU can restore open files in /proc/[pid]/ns, the same will be with 
/proc/namespaces/ files.
As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't 
see any
problem here.


An obvious diffference is that you are adding the inode to the inode to
the file name.  Which means that now you really do have to preserve the
inode numbers during process migration.

Which means now we have to do all of the work to make inode number
restoration possible.  Which means now we need to have multiple
instances of nsfs so that we can restore inode numbers.

I think this is still possible but we have been delaying figuring out
how to restore inode numbers long enough that may be actual technical
problems making it happen.

Now maybe CRIU can handle the names of the files changing during
migration but you have just increased the level of difficulty for doing
that.


Yes adding /proc/namespaces/:[] files may be a problem 
to CRIU.


First I would like to highlight that open files are not a problem. 
Because open file from /proc/namespaces/* are exactly the same as open 
files from /proc//ns/. So when we c/r an nsfs open file fd 
on dump we readlink the fd and get :[] and on restore 
we recreate each dumped namespace and open an fd to each, so we can 
'dup' it when restoring open file. It will be an fd to topologically 
same namespace though ns_ino would be newly generated.


But the problem I see is with readdir. What if some task is reading 
/proc/namespaces/ directory at the time of dump, after restore directory 
will contain new names for namespaces and possibly in different order, 
this way if process continues to readdir it can miss some namespaces or 
read some twice.


May be instead of multiple files in /proc/namespaces directory, we can 
leave just one file /proc/namespaces and when we open it we would return 
e.g. a unix socket filled with all the fds of all namespacess visible at 
this point. It looks like a possible solution to the above problem.


CRIU can restore unix sockets with fds inside, so we should be able to 
dump process using this functionality.





If you have a specific worries about, let's discuss them.


I was asking and I am asking that it be described in the patch
description how a container using this feature can be migrated
from one machine to another.  This code is so close to being problematic
that we need be very careful we don't fundamentally break CRIU while
trying to make it's job simpler and easier.


CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R.
  

Further by not going through the processes it looks like you are
bypassing the existing permission checks.  Which has the potential
to allow someone to use a namespace who would not be able to otherwise.


I agree, and I wrote to Christian, that permissions should be more strict.
This just should be formalized. Let's discuss this.


So I think this goes one step too far but I am willing to be persuaded
otherwise.



Eric



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-07-30 Thread Eric W. Biederman
Kirill Tkhai  writes:

> On 30.07.2020 17:34, Eric W. Biederman wrote:
>> Kirill Tkhai  writes:
>> 
>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>> but some also may be as open files, which are not attached to a process.
>>> When a namespace open fd is sent over unix socket and then closed, it is
>>> impossible to know whether the namespace exists or not.
>>>
>>> Also, even if namespace is exposed as attached to a process or as open file,
>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>> this multiplies at tasks and fds number.
>> 
>> I am very dubious about this.
>> 
>> I have been avoiding exactly this kind of interface because it can
>> create rather fundamental problems with checkpoint restart.
>
> restart/restore :)
>
>> You do have some filtering and the filtering is not based on current.
>> Which is good.
>> 
>> A view that is relative to a user namespace might be ok.It almost
>> certainly does better as it's own little filesystem than as an extension
>> to proc though.
>> 
>> The big thing we want to ensure is that if you migrate you can restore
>> everything.  I don't see how you will be able to restore these files
>> after migration.  Anything like this without having a complete
>> checkpoint/restore story is a non-starter.
>
> There is no difference between files in /proc/namespaces/ directory and 
> /proc/[pid]/ns/.
>
> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
> /proc/namespaces/ files.
> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't 
> see any
> problem here.

An obvious diffference is that you are adding the inode to the inode to
the file name.  Which means that now you really do have to preserve the
inode numbers during process migration.

Which means now we have to do all of the work to make inode number
restoration possible.  Which means now we need to have multiple
instances of nsfs so that we can restore inode numbers.

I think this is still possible but we have been delaying figuring out
how to restore inode numbers long enough that may be actual technical
problems making it happen.

Now maybe CRIU can handle the names of the files changing during
migration but you have just increased the level of difficulty for doing
that.

> If you have a specific worries about, let's discuss them.

I was asking and I am asking that it be described in the patch
description how a container using this feature can be migrated
from one machine to another.  This code is so close to being problematic
that we need be very careful we don't fundamentally break CRIU while
trying to make it's job simpler and easier.

> CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces 
> C/R.
>  
>> Further by not going through the processes it looks like you are
>> bypassing the existing permission checks.  Which has the potential
>> to allow someone to use a namespace who would not be able to otherwise.
>
> I agree, and I wrote to Christian, that permissions should be more strict.
> This just should be formalized. Let's discuss this.
>
>> So I think this goes one step too far but I am willing to be persuaded
>> otherwise.
>> 

Eric



Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-07-30 Thread Kirill Tkhai
On 30.07.2020 17:34, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> Currently, there is no a way to list or iterate all or subset of namespaces
>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>> but some also may be as open files, which are not attached to a process.
>> When a namespace open fd is sent over unix socket and then closed, it is
>> impossible to know whether the namespace exists or not.
>>
>> Also, even if namespace is exposed as attached to a process or as open file,
>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>> this multiplies at tasks and fds number.
> 
> I am very dubious about this.
> 
> I have been avoiding exactly this kind of interface because it can
> create rather fundamental problems with checkpoint restart.

restart/restore :)

> You do have some filtering and the filtering is not based on current.
> Which is good.
> 
> A view that is relative to a user namespace might be ok.It almost
> certainly does better as it's own little filesystem than as an extension
> to proc though.
> 
> The big thing we want to ensure is that if you migrate you can restore
> everything.  I don't see how you will be able to restore these files
> after migration.  Anything like this without having a complete
> checkpoint/restore story is a non-starter.

There is no difference between files in /proc/namespaces/ directory and 
/proc/[pid]/ns/.

CRIU can restore open files in /proc/[pid]/ns, the same will be with 
/proc/namespaces/ files.
As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't 
see any
problem here.

If you have a specific worries about, let's discuss them.

CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R.
 
> Further by not going through the processes it looks like you are
> bypassing the existing permission checks.  Which has the potential
> to allow someone to use a namespace who would not be able to otherwise.

I agree, and I wrote to Christian, that permissions should be more strict.
This just should be formalized. Let's discuss this.

> So I think this goes one step too far but I am willing to be persuaded
> otherwise.
> 
> Eric
> 
> 
> 
> 
>> This patchset introduces a new /proc/namespaces/ directory, which exposes
>> subset of permitted namespaces in linear view:
>>
>> # ls /proc/namespaces/ -l
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 
>> 'cgroup:[4026531835]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 
>> 'ipc:[4026531839]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 
>> 'mnt:[4026531840]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 
>> 'mnt:[4026531861]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 
>> 'mnt:[4026532133]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 
>> 'mnt:[4026532134]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 
>> 'mnt:[4026532135]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 
>> 'mnt:[4026532136]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 
>> 'net:[4026531993]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 
>> 'pid:[4026531836]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 
>> 'time:[4026531834]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 
>> 'user:[4026531837]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 
>> 'uts:[4026531838]'
>>
>> Namespace ns is exposed, in case of its user_ns is permitted from /proc's 
>> pid_ns.
>> I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, 
>> which is
>>
>>  in_userns(pid_ns->user_ns, ns->user_ns).
>>
>> In case of ns is a user_ns:
>>
>>  in_userns(pid_ns->user_ns, ns).
>>
>> The patchset follows this steps:
>>
>> 1)A generic counter in ns_common is introduced instead of separate
>>   counters for every ns type (net::count, uts_namespace::kref,
>>   user_namespace::count, etc). Patches [1-8];
>> 2)Patch [9] introduces IDR to link and iterate alive namespaces;
>> 3)Patch [10] is refactoring;
>> 4)Patch [11] actually adds /proc/namespace directory and fs methods;
>> 5)Patches [12-23] make every namespace to use the added methods
>>   and to appear in /proc/namespace directory.
>>
>> This may be usefull to write effective debug utils (say, fast build
>> of networks topology) and checkpoint/restore software.
>> ---
>>
>> Kirill Tkhai (23):
>>   ns: Add common refcount into ns_common add use it as counter for net_ns
>>   uts: Use generic ns_common::count
>>   ipc: Use generic ns_common::count
>>   pid: Use generic ns_common::count
>>   user: Use generic ns_common::count
>>   mnt: Use generic ns_common::count
>>   cgroup: Use generic ns_common::count
>>   time: Use generic ns_common::count
>>   ns: Introduce ns_idr to be able to iterate all allocated namespaces in 
>> the system
>>   f

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-07-30 Thread Christian Brauner
On Thu, Jul 30, 2020 at 09:34:01AM -0500, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
> > Currently, there is no a way to list or iterate all or subset of namespaces
> > in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> > but some also may be as open files, which are not attached to a process.
> > When a namespace open fd is sent over unix socket and then closed, it is
> > impossible to know whether the namespace exists or not.
> >
> > Also, even if namespace is exposed as attached to a process or as open file,
> > iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> > this multiplies at tasks and fds number.
> 
> I am very dubious about this.
> 
> I have been avoiding exactly this kind of interface because it can
> create rather fundamental problems with checkpoint restart.
> 
> You do have some filtering and the filtering is not based on current.
> Which is good.
> 
> A view that is relative to a user namespace might be ok.It almost
> certainly does better as it's own little filesystem than as an extension
> to proc though.
> 
> The big thing we want to ensure is that if you migrate you can restore
> everything.  I don't see how you will be able to restore these files
> after migration.  Anything like this without having a complete
> checkpoint/restore story is a non-starter.
> 
> Further by not going through the processes it looks like you are
> bypassing the existing permission checks.  Which has the potential
> to allow someone to use a namespace who would not be able to otherwise.
> 
> So I think this goes one step too far but I am willing to be persuaded
> otherwise.

I think we discussed this at Plumbers (last year I want to say?) and
you were against making this a part of procfs already back then, I
think. The last known idead we could agree on was debugfs (shudder). But
a tiny separate fs might work as well.

We really would want those introspection abilities this provides though.
For us it was for debugging when namespaces linger and also to crawl
and inspect namespaces from LXD and various other use-cases. So if we
could make this happen in some form that'd be great.

Thanks!
Christian


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-07-30 Thread Eric W. Biederman
Kirill Tkhai  writes:

> Currently, there is no a way to list or iterate all or subset of namespaces
> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> but some also may be as open files, which are not attached to a process.
> When a namespace open fd is sent over unix socket and then closed, it is
> impossible to know whether the namespace exists or not.
>
> Also, even if namespace is exposed as attached to a process or as open file,
> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> this multiplies at tasks and fds number.

I am very dubious about this.

I have been avoiding exactly this kind of interface because it can
create rather fundamental problems with checkpoint restart.

You do have some filtering and the filtering is not based on current.
Which is good.

A view that is relative to a user namespace might be ok.It almost
certainly does better as it's own little filesystem than as an extension
to proc though.

The big thing we want to ensure is that if you migrate you can restore
everything.  I don't see how you will be able to restore these files
after migration.  Anything like this without having a complete
checkpoint/restore story is a non-starter.

Further by not going through the processes it looks like you are
bypassing the existing permission checks.  Which has the potential
to allow someone to use a namespace who would not be able to otherwise.

So I think this goes one step too far but I am willing to be persuaded
otherwise.

Eric




> This patchset introduces a new /proc/namespaces/ directory, which exposes
> subset of permitted namespaces in linear view:
>
> # ls /proc/namespaces/ -l
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 
> 'cgroup:[4026531835]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 
> 'time:[4026531834]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 
> 'user:[4026531837]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'
>
> Namespace ns is exposed, in case of its user_ns is permitted from /proc's 
> pid_ns.
> I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, 
> which is
>
>   in_userns(pid_ns->user_ns, ns->user_ns).
>
> In case of ns is a user_ns:
>
>   in_userns(pid_ns->user_ns, ns).
>
> The patchset follows this steps:
>
> 1)A generic counter in ns_common is introduced instead of separate
>   counters for every ns type (net::count, uts_namespace::kref,
>   user_namespace::count, etc). Patches [1-8];
> 2)Patch [9] introduces IDR to link and iterate alive namespaces;
> 3)Patch [10] is refactoring;
> 4)Patch [11] actually adds /proc/namespace directory and fs methods;
> 5)Patches [12-23] make every namespace to use the added methods
>   and to appear in /proc/namespace directory.
>
> This may be usefull to write effective debug utils (say, fast build
> of networks topology) and checkpoint/restore software.
> ---
>
> Kirill Tkhai (23):
>   ns: Add common refcount into ns_common add use it as counter for net_ns
>   uts: Use generic ns_common::count
>   ipc: Use generic ns_common::count
>   pid: Use generic ns_common::count
>   user: Use generic ns_common::count
>   mnt: Use generic ns_common::count
>   cgroup: Use generic ns_common::count
>   time: Use generic ns_common::count
>   ns: Introduce ns_idr to be able to iterate all allocated namespaces in 
> the system
>   fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c
>   fs: Add /proc/namespaces/ directory
>   user: Free user_ns one RCU grace period after final counter put
>   user: Add user namespaces into ns_idr
>   net: Add net namespaces into ns_idr
>   pid: Eextract child_reaper check from pidns_for_children_get()
>   proc_ns_operations: Add can_get method
>   pid: Add pid namespaces into ns_idr
>   uts: Free uts namespace one RCU grace period after final counter put
>   uts: Add uts namespaces into ns_idr
>   ipc: Add ipc namespaces into ns_idr
>   mnt: Add mount namespaces into ns_idr
>   cgroup: Add cgroup namespaces into ns_idr
>   time: Add time namespaces into ns_idr
>
>
>  fs/mount.h |4 
>  fs

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-07-30 Thread Christian Brauner
[Cc: linux-api]

On Thu, Jul 30, 2020 at 03:08:53PM +0200, Christian Brauner wrote:
> On Thu, Jul 30, 2020 at 02:59:20PM +0300, Kirill Tkhai wrote:
> > Currently, there is no a way to list or iterate all or subset of namespaces
> > in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> > but some also may be as open files, which are not attached to a process.
> > When a namespace open fd is sent over unix socket and then closed, it is
> > impossible to know whether the namespace exists or not.
> > 
> > Also, even if namespace is exposed as attached to a process or as open file,
> > iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> > this multiplies at tasks and fds number.
> > 
> > This patchset introduces a new /proc/namespaces/ directory, which exposes
> > subset of permitted namespaces in linear view:
> > 
> > # ls /proc/namespaces/ -l
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 
> > 'cgroup:[4026531835]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 
> > 'ipc:[4026531839]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 
> > 'mnt:[4026531840]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 
> > 'mnt:[4026531861]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 
> > 'mnt:[4026532133]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 
> > 'mnt:[4026532134]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 
> > 'mnt:[4026532135]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 
> > 'mnt:[4026532136]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 
> > 'net:[4026531993]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 
> > 'pid:[4026531836]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 
> > 'time:[4026531834]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 
> > 'user:[4026531837]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 
> > 'uts:[4026531838]'
> > 
> > Namespace ns is exposed, in case of its user_ns is permitted from /proc's 
> > pid_ns.
> > I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, 
> > which is
> > 
> > in_userns(pid_ns->user_ns, ns->user_ns).
> > 
> > In case of ns is a user_ns:
> > 
> > in_userns(pid_ns->user_ns, ns).
> > 
> > The patchset follows this steps:
> > 
> > 1)A generic counter in ns_common is introduced instead of separate
> >   counters for every ns type (net::count, uts_namespace::kref,
> >   user_namespace::count, etc). Patches [1-8];
> > 2)Patch [9] introduces IDR to link and iterate alive namespaces;
> > 3)Patch [10] is refactoring;
> > 4)Patch [11] actually adds /proc/namespace directory and fs methods;
> > 5)Patches [12-23] make every namespace to use the added methods
> >   and to appear in /proc/namespace directory.
> > 
> > This may be usefull to write effective debug utils (say, fast build
> > of networks topology) and checkpoint/restore software.
> 
> Kirill,
> 
> Thanks for working on this!
> We have a need for this functionality too for namespace introspection.
> I actually had a prototype of this as well but mine was based on debugfs
> but /proc/namespaces seems like a good place.


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-07-30 Thread Christian Brauner
On Thu, Jul 30, 2020 at 02:59:20PM +0300, Kirill Tkhai wrote:
> Currently, there is no a way to list or iterate all or subset of namespaces
> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> but some also may be as open files, which are not attached to a process.
> When a namespace open fd is sent over unix socket and then closed, it is
> impossible to know whether the namespace exists or not.
> 
> Also, even if namespace is exposed as attached to a process or as open file,
> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> this multiplies at tasks and fds number.
> 
> This patchset introduces a new /proc/namespaces/ directory, which exposes
> subset of permitted namespaces in linear view:
> 
> # ls /proc/namespaces/ -l
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 
> 'cgroup:[4026531835]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 
> 'time:[4026531834]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 
> 'user:[4026531837]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'
> 
> Namespace ns is exposed, in case of its user_ns is permitted from /proc's 
> pid_ns.
> I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, 
> which is
> 
>   in_userns(pid_ns->user_ns, ns->user_ns).
> 
> In case of ns is a user_ns:
> 
>   in_userns(pid_ns->user_ns, ns).
> 
> The patchset follows this steps:
> 
> 1)A generic counter in ns_common is introduced instead of separate
>   counters for every ns type (net::count, uts_namespace::kref,
>   user_namespace::count, etc). Patches [1-8];
> 2)Patch [9] introduces IDR to link and iterate alive namespaces;
> 3)Patch [10] is refactoring;
> 4)Patch [11] actually adds /proc/namespace directory and fs methods;
> 5)Patches [12-23] make every namespace to use the added methods
>   and to appear in /proc/namespace directory.
> 
> This may be usefull to write effective debug utils (say, fast build
> of networks topology) and checkpoint/restore software.

Kirill,

Thanks for working on this!
We have a need for this functionality too for namespace introspection.
I actually had a prototype of this as well but mine was based on debugfs
but /proc/namespaces seems like a good place.

Christian


[PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-07-30 Thread Kirill Tkhai
Currently, there is no a way to list or iterate all or subset of namespaces
in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
but some also may be as open files, which are not attached to a process.
When a namespace open fd is sent over unix socket and then closed, it is
impossible to know whether the namespace exists or not.

Also, even if namespace is exposed as attached to a process or as open file,
iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
this multiplies at tasks and fds number.

This patchset introduces a new /proc/namespaces/ directory, which exposes
subset of permitted namespaces in linear view:

# ls /proc/namespaces/ -l
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 
'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'

Namespace ns is exposed, in case of its user_ns is permitted from /proc's 
pid_ns.
I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, 
which is

in_userns(pid_ns->user_ns, ns->user_ns).

In case of ns is a user_ns:

in_userns(pid_ns->user_ns, ns).

The patchset follows this steps:

1)A generic counter in ns_common is introduced instead of separate
  counters for every ns type (net::count, uts_namespace::kref,
  user_namespace::count, etc). Patches [1-8];
2)Patch [9] introduces IDR to link and iterate alive namespaces;
3)Patch [10] is refactoring;
4)Patch [11] actually adds /proc/namespace directory and fs methods;
5)Patches [12-23] make every namespace to use the added methods
  and to appear in /proc/namespace directory.

This may be usefull to write effective debug utils (say, fast build
of networks topology) and checkpoint/restore software.
---

Kirill Tkhai (23):
  ns: Add common refcount into ns_common add use it as counter for net_ns
  uts: Use generic ns_common::count
  ipc: Use generic ns_common::count
  pid: Use generic ns_common::count
  user: Use generic ns_common::count
  mnt: Use generic ns_common::count
  cgroup: Use generic ns_common::count
  time: Use generic ns_common::count
  ns: Introduce ns_idr to be able to iterate all allocated namespaces in 
the system
  fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c
  fs: Add /proc/namespaces/ directory
  user: Free user_ns one RCU grace period after final counter put
  user: Add user namespaces into ns_idr
  net: Add net namespaces into ns_idr
  pid: Eextract child_reaper check from pidns_for_children_get()
  proc_ns_operations: Add can_get method
  pid: Add pid namespaces into ns_idr
  uts: Free uts namespace one RCU grace period after final counter put
  uts: Add uts namespaces into ns_idr
  ipc: Add ipc namespaces into ns_idr
  mnt: Add mount namespaces into ns_idr
  cgroup: Add cgroup namespaces into ns_idr
  time: Add time namespaces into ns_idr


 fs/mount.h |4 
 fs/namespace.c |   14 +
 fs/nsfs.c  |   78 
 fs/proc/Makefile   |1 
 fs/proc/internal.h |   18 +-
 fs/proc/namespaces.c   |  382 +++-
 fs/proc/root.c |   17 ++
 fs/proc/task_namespaces.c  |  183 +++
 include/linux/cgroup.h |6 -
 include/linux/ipc_namespace.h  |3 
 include/linux/ns_common.h  |   11 +
 include/linux/pid_namespace.h  |4 
 include/linux/proc_fs.h|1 
 include/linux/proc_ns.h|   12 +
 include/linux/time_namespace.h |   10 +
 include/linux/user_namespace.h |   10 +
 include/linux/utsname.h|   10 +
 include/net/net_namespace.h|   11 -
 init/version.c |2 
 ipc/msgutil.c  |2 
 ipc/namespace.c|   17 +-
 ipc/shm.c  |1 
 kernel/cgroup/cgroup.c |2 
 kernel/cgroup/namespace.c  |   25 ++-
 kernel/pid.c   |2 
 kernel/pid_namespace.c |   46 +++--
 kernel/time/namespace.c|   20 +-