Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-27 Thread Christian Brauner
On Thu, Apr 26, 2018 at 07:35:47PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner  writes:
> >> 
> >> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner  writes:
> >> >> 
> >> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> >> >> Christian Brauner  writes:
> >> >> >> 
> >> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman 
> >> >> >> >  wrote:
> >> >> >> >
> >> >> >> >  Bah. This code is obviously correct and probably wrong.
> >> >> >> >
> >> >> >> >  How do we deliver uevents for network devices that are outside of 
> >> >> >> > the
> >> >> >> >  initial user namespace? The kernel still needs to deliver those.
> >> >> >> >
> >> >> >> >  The logic to figure out which network namespace a device needs to 
> >> >> >> > be
> >> >> >> >  delivered to is is present in kobj_bcast_filter. That logic will 
> >> >> >> > almost
> >> >> >> >  certainly need to be turned inside out. Sign not as easy as I 
> >> >> >> > would
> >> >> >> >  have hoped.
> >> >> >> >
> >> >> >> > My first patch that we discussed put additional filtering logic 
> >> >> >> > into kobj_bcast_filter for that very reason. But I can move that 
> >> >> >> > logic
> >> >> >> > out and come up with a new patch.
> >> >> >> 
> >> >> >> I may have mis-understood.
> >> >> >> 
> >> >> >> I heard and am still hearing additional filtering to reduce the 
> >> >> >> places
> >> >> >> the packet is delievered.
> >> >> >> 
> >> >> >> I am saying something needs to change to increase the number of 
> >> >> >> places
> >> >> >> the packet is delivered.
> >> >> >> 
> >> >> >> For the special class of devices that kobj_bcast_filter would apply 
> >> >> >> to
> >> >> >> those need to be delivered to netowrk namespaces  that are no longer 
> >> >> >> on
> >> >> >> uevent_sock_list.
> >> >> >> 
> >> >> >> So the code fundamentally needs to split into two paths.  Ordinary
> >> >> >> devices that use uevent_sock_list.  Network devices that are just
> >> >> >> delivered in their own network namespace.
> >> >> >> 
> >> >> >> netlink_broadcast_filtered gets to go away completely.
> >> >> >
> >> >> > The split *might* make sense but I think you're wrong about removing 
> >> >> > the
> >> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> >> >> > socket in uevent_sock_list itself it rather operates on the sockets in
> >> >> > mc_list. And if socket in mc_list can have a different network 
> >> >> > namespace
> >> >> > then the uevent_socket itself then your way won't work. That's why my
> >> >> > original patch added additional filtering in there. The way I see it 
> >> >> > we
> >> >> > need something like:
> >> >> 
> >> >> We already filter the sockets in the mc_list by network namespace.
> >> >
> >> > Oh really? That's good to know. I haven't found where in the code this
> >> > actually happens. I thought that when netlink_bind() is called anyone
> >> > could register themselves in mc_list.
> >> 
> >> The code in af_netlink.c does:
> >> > static void do_one_broadcast(struct sock *sk,
> >> >  struct netlink_broadcast_data *p)
> >> > {
> >> >  struct netlink_sock *nlk = nlk_sk(sk);
> >> >  int val;
> >> > 
> >> >  if (p->exclude_sk == sk)
> >> >  return;
> >> > 
> >> >  if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
> >> >  !test_bit(p->group - 1, nlk->groups))
> >> >  return;
> >> > 
> >> >  if (!net_eq(sock_net(sk), p->net)) {
> >>  Here
> >> >  if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >  return;
> >> ^^^ Here
> >> > 
> >> >  if (!peernet_has_id(sock_net(sk), p->net))
> >> >  return;
> >> > 
> >> >  if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >   CAP_NET_BROADCAST))
> >> >  return;
> >> >  }
> >> 
> >> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
> >> you out if you are the wrong network namespace.
> >> 
> >> 
> >> >> When a packet is transmitted with netlink_broadcast it is only
> >> >> transmitted within a single network namespace.
> >> >> 
> >> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> >> >> with it's source network namespace so no confusion will result, and the
> >> >> permission checks have been done to make it safe. So you can safely
> >> >> ignore that case.  Please ignore that case.  It only needs to be
> >> >> considered if refactoring af_netlink.c
> >> >> 
> >> >> When I added netlink_broadcast_filtered I imagined that we would need
> >> >> code that worked across network namespaces that worked for 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-27 Thread Christian Brauner
On Thu, Apr 26, 2018 at 07:35:47PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner  writes:
> >> 
> >> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner  writes:
> >> >> 
> >> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> >> >> Christian Brauner  writes:
> >> >> >> 
> >> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman 
> >> >> >> >  wrote:
> >> >> >> >
> >> >> >> >  Bah. This code is obviously correct and probably wrong.
> >> >> >> >
> >> >> >> >  How do we deliver uevents for network devices that are outside of 
> >> >> >> > the
> >> >> >> >  initial user namespace? The kernel still needs to deliver those.
> >> >> >> >
> >> >> >> >  The logic to figure out which network namespace a device needs to 
> >> >> >> > be
> >> >> >> >  delivered to is is present in kobj_bcast_filter. That logic will 
> >> >> >> > almost
> >> >> >> >  certainly need to be turned inside out. Sign not as easy as I 
> >> >> >> > would
> >> >> >> >  have hoped.
> >> >> >> >
> >> >> >> > My first patch that we discussed put additional filtering logic 
> >> >> >> > into kobj_bcast_filter for that very reason. But I can move that 
> >> >> >> > logic
> >> >> >> > out and come up with a new patch.
> >> >> >> 
> >> >> >> I may have mis-understood.
> >> >> >> 
> >> >> >> I heard and am still hearing additional filtering to reduce the 
> >> >> >> places
> >> >> >> the packet is delievered.
> >> >> >> 
> >> >> >> I am saying something needs to change to increase the number of 
> >> >> >> places
> >> >> >> the packet is delivered.
> >> >> >> 
> >> >> >> For the special class of devices that kobj_bcast_filter would apply 
> >> >> >> to
> >> >> >> those need to be delivered to netowrk namespaces  that are no longer 
> >> >> >> on
> >> >> >> uevent_sock_list.
> >> >> >> 
> >> >> >> So the code fundamentally needs to split into two paths.  Ordinary
> >> >> >> devices that use uevent_sock_list.  Network devices that are just
> >> >> >> delivered in their own network namespace.
> >> >> >> 
> >> >> >> netlink_broadcast_filtered gets to go away completely.
> >> >> >
> >> >> > The split *might* make sense but I think you're wrong about removing 
> >> >> > the
> >> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> >> >> > socket in uevent_sock_list itself it rather operates on the sockets in
> >> >> > mc_list. And if socket in mc_list can have a different network 
> >> >> > namespace
> >> >> > then the uevent_socket itself then your way won't work. That's why my
> >> >> > original patch added additional filtering in there. The way I see it 
> >> >> > we
> >> >> > need something like:
> >> >> 
> >> >> We already filter the sockets in the mc_list by network namespace.
> >> >
> >> > Oh really? That's good to know. I haven't found where in the code this
> >> > actually happens. I thought that when netlink_bind() is called anyone
> >> > could register themselves in mc_list.
> >> 
> >> The code in af_netlink.c does:
> >> > static void do_one_broadcast(struct sock *sk,
> >> >  struct netlink_broadcast_data *p)
> >> > {
> >> >  struct netlink_sock *nlk = nlk_sk(sk);
> >> >  int val;
> >> > 
> >> >  if (p->exclude_sk == sk)
> >> >  return;
> >> > 
> >> >  if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
> >> >  !test_bit(p->group - 1, nlk->groups))
> >> >  return;
> >> > 
> >> >  if (!net_eq(sock_net(sk), p->net)) {
> >>  Here
> >> >  if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >  return;
> >> ^^^ Here
> >> > 
> >> >  if (!peernet_has_id(sock_net(sk), p->net))
> >> >  return;
> >> > 
> >> >  if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >   CAP_NET_BROADCAST))
> >> >  return;
> >> >  }
> >> 
> >> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
> >> you out if you are the wrong network namespace.
> >> 
> >> 
> >> >> When a packet is transmitted with netlink_broadcast it is only
> >> >> transmitted within a single network namespace.
> >> >> 
> >> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> >> >> with it's source network namespace so no confusion will result, and the
> >> >> permission checks have been done to make it safe. So you can safely
> >> >> ignore that case.  Please ignore that case.  It only needs to be
> >> >> considered if refactoring af_netlink.c
> >> >> 
> >> >> When I added netlink_broadcast_filtered I imagined that we would need
> >> >> code that worked across network namespaces that worked for different
> >> >> namespaces.   So it looked like we would need the level of granularity
> >> >> that you can get with netlink_broadcast_filtered.  It turns 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Eric W. Biederman
Christian Brauner  writes:

> On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner  writes:
>> >> 
>> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
>> >> >> Christian Brauner  writes:
>> >> >> 
>> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman 
>> >> >> >  wrote:
>> >> >> >
>> >> >> >  Bah. This code is obviously correct and probably wrong.
>> >> >> >
>> >> >> >  How do we deliver uevents for network devices that are outside of 
>> >> >> > the
>> >> >> >  initial user namespace? The kernel still needs to deliver those.
>> >> >> >
>> >> >> >  The logic to figure out which network namespace a device needs to be
>> >> >> >  delivered to is is present in kobj_bcast_filter. That logic will 
>> >> >> > almost
>> >> >> >  certainly need to be turned inside out. Sign not as easy as I would
>> >> >> >  have hoped.
>> >> >> >
>> >> >> > My first patch that we discussed put additional filtering logic into 
>> >> >> > kobj_bcast_filter for that very reason. But I can move that logic
>> >> >> > out and come up with a new patch.
>> >> >> 
>> >> >> I may have mis-understood.
>> >> >> 
>> >> >> I heard and am still hearing additional filtering to reduce the places
>> >> >> the packet is delievered.
>> >> >> 
>> >> >> I am saying something needs to change to increase the number of places
>> >> >> the packet is delivered.
>> >> >> 
>> >> >> For the special class of devices that kobj_bcast_filter would apply to
>> >> >> those need to be delivered to netowrk namespaces  that are no longer on
>> >> >> uevent_sock_list.
>> >> >> 
>> >> >> So the code fundamentally needs to split into two paths.  Ordinary
>> >> >> devices that use uevent_sock_list.  Network devices that are just
>> >> >> delivered in their own network namespace.
>> >> >> 
>> >> >> netlink_broadcast_filtered gets to go away completely.
>> >> >
>> >> > The split *might* make sense but I think you're wrong about removing the
>> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
>> >> > socket in uevent_sock_list itself it rather operates on the sockets in
>> >> > mc_list. And if socket in mc_list can have a different network namespace
>> >> > then the uevent_socket itself then your way won't work. That's why my
>> >> > original patch added additional filtering in there. The way I see it we
>> >> > need something like:
>> >> 
>> >> We already filter the sockets in the mc_list by network namespace.
>> >
>> > Oh really? That's good to know. I haven't found where in the code this
>> > actually happens. I thought that when netlink_bind() is called anyone
>> > could register themselves in mc_list.
>> 
>> The code in af_netlink.c does:
>> > static void do_one_broadcast(struct sock *sk,
>> >struct netlink_broadcast_data *p)
>> > {
>> >struct netlink_sock *nlk = nlk_sk(sk);
>> >int val;
>> > 
>> >if (p->exclude_sk == sk)
>> >return;
>> > 
>> >if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
>> >!test_bit(p->group - 1, nlk->groups))
>> >return;
>> > 
>> >if (!net_eq(sock_net(sk), p->net)) {
>>  Here
>> >if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >return;
>> ^^^ Here
>> > 
>> >if (!peernet_has_id(sock_net(sk), p->net))
>> >return;
>> > 
>> >if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> > CAP_NET_BROADCAST))
>> >return;
>> >}
>> 
>> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
>> you out if you are the wrong network namespace.
>> 
>> 
>> >> When a packet is transmitted with netlink_broadcast it is only
>> >> transmitted within a single network namespace.
>> >> 
>> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
>> >> with it's source network namespace so no confusion will result, and the
>> >> permission checks have been done to make it safe. So you can safely
>> >> ignore that case.  Please ignore that case.  It only needs to be
>> >> considered if refactoring af_netlink.c
>> >> 
>> >> When I added netlink_broadcast_filtered I imagined that we would need
>> >> code that worked across network namespaces that worked for different
>> >> namespaces.   So it looked like we would need the level of granularity
>> >> that you can get with netlink_broadcast_filtered.  It turns out we don't
>> >> and that it was a case of over design.  As the only split we care about
>> >> is per network namespace there is no need for
>> >> netlink_broadcast_filtered.
>> >> 
>> >> > 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Eric W. Biederman
Christian Brauner  writes:

> On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner  writes:
>> >> 
>> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
>> >> >> Christian Brauner  writes:
>> >> >> 
>> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman 
>> >> >> >  wrote:
>> >> >> >
>> >> >> >  Bah. This code is obviously correct and probably wrong.
>> >> >> >
>> >> >> >  How do we deliver uevents for network devices that are outside of 
>> >> >> > the
>> >> >> >  initial user namespace? The kernel still needs to deliver those.
>> >> >> >
>> >> >> >  The logic to figure out which network namespace a device needs to be
>> >> >> >  delivered to is is present in kobj_bcast_filter. That logic will 
>> >> >> > almost
>> >> >> >  certainly need to be turned inside out. Sign not as easy as I would
>> >> >> >  have hoped.
>> >> >> >
>> >> >> > My first patch that we discussed put additional filtering logic into 
>> >> >> > kobj_bcast_filter for that very reason. But I can move that logic
>> >> >> > out and come up with a new patch.
>> >> >> 
>> >> >> I may have mis-understood.
>> >> >> 
>> >> >> I heard and am still hearing additional filtering to reduce the places
>> >> >> the packet is delievered.
>> >> >> 
>> >> >> I am saying something needs to change to increase the number of places
>> >> >> the packet is delivered.
>> >> >> 
>> >> >> For the special class of devices that kobj_bcast_filter would apply to
>> >> >> those need to be delivered to netowrk namespaces  that are no longer on
>> >> >> uevent_sock_list.
>> >> >> 
>> >> >> So the code fundamentally needs to split into two paths.  Ordinary
>> >> >> devices that use uevent_sock_list.  Network devices that are just
>> >> >> delivered in their own network namespace.
>> >> >> 
>> >> >> netlink_broadcast_filtered gets to go away completely.
>> >> >
>> >> > The split *might* make sense but I think you're wrong about removing the
>> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
>> >> > socket in uevent_sock_list itself it rather operates on the sockets in
>> >> > mc_list. And if socket in mc_list can have a different network namespace
>> >> > then the uevent_socket itself then your way won't work. That's why my
>> >> > original patch added additional filtering in there. The way I see it we
>> >> > need something like:
>> >> 
>> >> We already filter the sockets in the mc_list by network namespace.
>> >
>> > Oh really? That's good to know. I haven't found where in the code this
>> > actually happens. I thought that when netlink_bind() is called anyone
>> > could register themselves in mc_list.
>> 
>> The code in af_netlink.c does:
>> > static void do_one_broadcast(struct sock *sk,
>> >struct netlink_broadcast_data *p)
>> > {
>> >struct netlink_sock *nlk = nlk_sk(sk);
>> >int val;
>> > 
>> >if (p->exclude_sk == sk)
>> >return;
>> > 
>> >if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
>> >!test_bit(p->group - 1, nlk->groups))
>> >return;
>> > 
>> >if (!net_eq(sock_net(sk), p->net)) {
>>  Here
>> >if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >return;
>> ^^^ Here
>> > 
>> >if (!peernet_has_id(sock_net(sk), p->net))
>> >return;
>> > 
>> >if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> > CAP_NET_BROADCAST))
>> >return;
>> >}
>> 
>> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
>> you out if you are the wrong network namespace.
>> 
>> 
>> >> When a packet is transmitted with netlink_broadcast it is only
>> >> transmitted within a single network namespace.
>> >> 
>> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
>> >> with it's source network namespace so no confusion will result, and the
>> >> permission checks have been done to make it safe. So you can safely
>> >> ignore that case.  Please ignore that case.  It only needs to be
>> >> considered if refactoring af_netlink.c
>> >> 
>> >> When I added netlink_broadcast_filtered I imagined that we would need
>> >> code that worked across network namespaces that worked for different
>> >> namespaces.   So it looked like we would need the level of granularity
>> >> that you can get with netlink_broadcast_filtered.  It turns out we don't
>> >> and that it was a case of over design.  As the only split we care about
>> >> is per network namespace there is no need for
>> >> netlink_broadcast_filtered.
>> >> 
>> >> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
>> >> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
>> >> >
>> >> > The question that 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Christian Brauner
On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner  writes:
> >> 
> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner  writes:
> >> >> 
> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  
> >> >> > wrote:
> >> >> >
> >> >> >  Bah. This code is obviously correct and probably wrong.
> >> >> >
> >> >> >  How do we deliver uevents for network devices that are outside of the
> >> >> >  initial user namespace? The kernel still needs to deliver those.
> >> >> >
> >> >> >  The logic to figure out which network namespace a device needs to be
> >> >> >  delivered to is is present in kobj_bcast_filter. That logic will 
> >> >> > almost
> >> >> >  certainly need to be turned inside out. Sign not as easy as I would
> >> >> >  have hoped.
> >> >> >
> >> >> > My first patch that we discussed put additional filtering logic into 
> >> >> > kobj_bcast_filter for that very reason. But I can move that logic
> >> >> > out and come up with a new patch.
> >> >> 
> >> >> I may have mis-understood.
> >> >> 
> >> >> I heard and am still hearing additional filtering to reduce the places
> >> >> the packet is delievered.
> >> >> 
> >> >> I am saying something needs to change to increase the number of places
> >> >> the packet is delivered.
> >> >> 
> >> >> For the special class of devices that kobj_bcast_filter would apply to
> >> >> those need to be delivered to netowrk namespaces  that are no longer on
> >> >> uevent_sock_list.
> >> >> 
> >> >> So the code fundamentally needs to split into two paths.  Ordinary
> >> >> devices that use uevent_sock_list.  Network devices that are just
> >> >> delivered in their own network namespace.
> >> >> 
> >> >> netlink_broadcast_filtered gets to go away completely.
> >> >
> >> > The split *might* make sense but I think you're wrong about removing the
> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> >> > socket in uevent_sock_list itself it rather operates on the sockets in
> >> > mc_list. And if socket in mc_list can have a different network namespace
> >> > then the uevent_socket itself then your way won't work. That's why my
> >> > original patch added additional filtering in there. The way I see it we
> >> > need something like:
> >> 
> >> We already filter the sockets in the mc_list by network namespace.
> >
> > Oh really? That's good to know. I haven't found where in the code this
> > actually happens. I thought that when netlink_bind() is called anyone
> > could register themselves in mc_list.
> 
> The code in af_netlink.c does:
> > static void do_one_broadcast(struct sock *sk,
> > struct netlink_broadcast_data *p)
> > {
> > struct netlink_sock *nlk = nlk_sk(sk);
> > int val;
> > 
> > if (p->exclude_sk == sk)
> > return;
> > 
> > if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
> > !test_bit(p->group - 1, nlk->groups))
> > return;
> > 
> > if (!net_eq(sock_net(sk), p->net)) {
>  Here
> > if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> > return;
> ^^^ Here
> > 
> > if (!peernet_has_id(sock_net(sk), p->net))
> > return;
> > 
> > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >  CAP_NET_BROADCAST))
> > return;
> > }
> 
> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
> you out if you are the wrong network namespace.
> 
> 
> >> When a packet is transmitted with netlink_broadcast it is only
> >> transmitted within a single network namespace.
> >> 
> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> >> with it's source network namespace so no confusion will result, and the
> >> permission checks have been done to make it safe. So you can safely
> >> ignore that case.  Please ignore that case.  It only needs to be
> >> considered if refactoring af_netlink.c
> >> 
> >> When I added netlink_broadcast_filtered I imagined that we would need
> >> code that worked across network namespaces that worked for different
> >> namespaces.   So it looked like we would need the level of granularity
> >> that you can get with netlink_broadcast_filtered.  It turns out we don't
> >> and that it was a case of over design.  As the only split we care about
> >> is per network namespace there is no need for
> >> netlink_broadcast_filtered.
> >> 
> >> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> >> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
> >> >
> >> > The question that remains is 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Christian Brauner
On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner  writes:
> >> 
> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner  writes:
> >> >> 
> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  
> >> >> > wrote:
> >> >> >
> >> >> >  Bah. This code is obviously correct and probably wrong.
> >> >> >
> >> >> >  How do we deliver uevents for network devices that are outside of the
> >> >> >  initial user namespace? The kernel still needs to deliver those.
> >> >> >
> >> >> >  The logic to figure out which network namespace a device needs to be
> >> >> >  delivered to is is present in kobj_bcast_filter. That logic will 
> >> >> > almost
> >> >> >  certainly need to be turned inside out. Sign not as easy as I would
> >> >> >  have hoped.
> >> >> >
> >> >> > My first patch that we discussed put additional filtering logic into 
> >> >> > kobj_bcast_filter for that very reason. But I can move that logic
> >> >> > out and come up with a new patch.
> >> >> 
> >> >> I may have mis-understood.
> >> >> 
> >> >> I heard and am still hearing additional filtering to reduce the places
> >> >> the packet is delievered.
> >> >> 
> >> >> I am saying something needs to change to increase the number of places
> >> >> the packet is delivered.
> >> >> 
> >> >> For the special class of devices that kobj_bcast_filter would apply to
> >> >> those need to be delivered to netowrk namespaces  that are no longer on
> >> >> uevent_sock_list.
> >> >> 
> >> >> So the code fundamentally needs to split into two paths.  Ordinary
> >> >> devices that use uevent_sock_list.  Network devices that are just
> >> >> delivered in their own network namespace.
> >> >> 
> >> >> netlink_broadcast_filtered gets to go away completely.
> >> >
> >> > The split *might* make sense but I think you're wrong about removing the
> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> >> > socket in uevent_sock_list itself it rather operates on the sockets in
> >> > mc_list. And if socket in mc_list can have a different network namespace
> >> > then the uevent_socket itself then your way won't work. That's why my
> >> > original patch added additional filtering in there. The way I see it we
> >> > need something like:
> >> 
> >> We already filter the sockets in the mc_list by network namespace.
> >
> > Oh really? That's good to know. I haven't found where in the code this
> > actually happens. I thought that when netlink_bind() is called anyone
> > could register themselves in mc_list.
> 
> The code in af_netlink.c does:
> > static void do_one_broadcast(struct sock *sk,
> > struct netlink_broadcast_data *p)
> > {
> > struct netlink_sock *nlk = nlk_sk(sk);
> > int val;
> > 
> > if (p->exclude_sk == sk)
> > return;
> > 
> > if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
> > !test_bit(p->group - 1, nlk->groups))
> > return;
> > 
> > if (!net_eq(sock_net(sk), p->net)) {
>  Here
> > if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> > return;
> ^^^ Here
> > 
> > if (!peernet_has_id(sock_net(sk), p->net))
> > return;
> > 
> > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >  CAP_NET_BROADCAST))
> > return;
> > }
> 
> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
> you out if you are the wrong network namespace.
> 
> 
> >> When a packet is transmitted with netlink_broadcast it is only
> >> transmitted within a single network namespace.
> >> 
> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> >> with it's source network namespace so no confusion will result, and the
> >> permission checks have been done to make it safe. So you can safely
> >> ignore that case.  Please ignore that case.  It only needs to be
> >> considered if refactoring af_netlink.c
> >> 
> >> When I added netlink_broadcast_filtered I imagined that we would need
> >> code that worked across network namespaces that worked for different
> >> namespaces.   So it looked like we would need the level of granularity
> >> that you can get with netlink_broadcast_filtered.  It turns out we don't
> >> and that it was a case of over design.  As the only split we care about
> >> is per network namespace there is no need for
> >> netlink_broadcast_filtered.
> >> 
> >> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> >> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
> >> >
> >> > The question that remains is whether we can rely on the network
> >> > namespace information we can gather from the kobject_ns_type_operations
> >> > to 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Eric W. Biederman
Christian Brauner  writes:

> On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner  writes:
>> >> 
>> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  
>> >> > wrote:
>> >> >
>> >> >  Bah. This code is obviously correct and probably wrong.
>> >> >
>> >> >  How do we deliver uevents for network devices that are outside of the
>> >> >  initial user namespace? The kernel still needs to deliver those.
>> >> >
>> >> >  The logic to figure out which network namespace a device needs to be
>> >> >  delivered to is is present in kobj_bcast_filter. That logic will almost
>> >> >  certainly need to be turned inside out. Sign not as easy as I would
>> >> >  have hoped.
>> >> >
>> >> > My first patch that we discussed put additional filtering logic into 
>> >> > kobj_bcast_filter for that very reason. But I can move that logic
>> >> > out and come up with a new patch.
>> >> 
>> >> I may have mis-understood.
>> >> 
>> >> I heard and am still hearing additional filtering to reduce the places
>> >> the packet is delievered.
>> >> 
>> >> I am saying something needs to change to increase the number of places
>> >> the packet is delivered.
>> >> 
>> >> For the special class of devices that kobj_bcast_filter would apply to
>> >> those need to be delivered to netowrk namespaces  that are no longer on
>> >> uevent_sock_list.
>> >> 
>> >> So the code fundamentally needs to split into two paths.  Ordinary
>> >> devices that use uevent_sock_list.  Network devices that are just
>> >> delivered in their own network namespace.
>> >> 
>> >> netlink_broadcast_filtered gets to go away completely.
>> >
>> > The split *might* make sense but I think you're wrong about removing the
>> > kobj_bcast_filter. The current filter doesn't operate on the uevent
>> > socket in uevent_sock_list itself it rather operates on the sockets in
>> > mc_list. And if socket in mc_list can have a different network namespace
>> > then the uevent_socket itself then your way won't work. That's why my
>> > original patch added additional filtering in there. The way I see it we
>> > need something like:
>> 
>> We already filter the sockets in the mc_list by network namespace.
>
> Oh really? That's good to know. I haven't found where in the code this
> actually happens. I thought that when netlink_bind() is called anyone
> could register themselves in mc_list.

The code in af_netlink.c does:
> static void do_one_broadcast(struct sock *sk,
>   struct netlink_broadcast_data *p)
> {
>   struct netlink_sock *nlk = nlk_sk(sk);
>   int val;
> 
>   if (p->exclude_sk == sk)
>   return;
> 
>   if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
>   !test_bit(p->group - 1, nlk->groups))
>   return;
> 
>   if (!net_eq(sock_net(sk), p->net)) {
 Here
>   if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>   return;
^^^ Here
> 
>   if (!peernet_has_id(sock_net(sk), p->net))
>   return;
> 
>   if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>CAP_NET_BROADCAST))
>   return;
>   }

Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
you out if you are the wrong network namespace.


>> When a packet is transmitted with netlink_broadcast it is only
>> transmitted within a single network namespace.
>> 
>> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
>> with it's source network namespace so no confusion will result, and the
>> permission checks have been done to make it safe. So you can safely
>> ignore that case.  Please ignore that case.  It only needs to be
>> considered if refactoring af_netlink.c
>> 
>> When I added netlink_broadcast_filtered I imagined that we would need
>> code that worked across network namespaces that worked for different
>> namespaces.   So it looked like we would need the level of granularity
>> that you can get with netlink_broadcast_filtered.  It turns out we don't
>> and that it was a case of over design.  As the only split we care about
>> is per network namespace there is no need for
>> netlink_broadcast_filtered.
>> 
>> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
>> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
>> >
>> > The question that remains is whether we can rely on the network
>> > namespace information we can gather from the kobject_ns_type_operations
>> > to decide where we want to broadcast that event to. So something
>> > *like*:
>> 
>> We can.  We already do.  That is what kobj_bcast_filter implements.

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Eric W. Biederman
Christian Brauner  writes:

> On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner  writes:
>> >> 
>> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  
>> >> > wrote:
>> >> >
>> >> >  Bah. This code is obviously correct and probably wrong.
>> >> >
>> >> >  How do we deliver uevents for network devices that are outside of the
>> >> >  initial user namespace? The kernel still needs to deliver those.
>> >> >
>> >> >  The logic to figure out which network namespace a device needs to be
>> >> >  delivered to is is present in kobj_bcast_filter. That logic will almost
>> >> >  certainly need to be turned inside out. Sign not as easy as I would
>> >> >  have hoped.
>> >> >
>> >> > My first patch that we discussed put additional filtering logic into 
>> >> > kobj_bcast_filter for that very reason. But I can move that logic
>> >> > out and come up with a new patch.
>> >> 
>> >> I may have mis-understood.
>> >> 
>> >> I heard and am still hearing additional filtering to reduce the places
>> >> the packet is delievered.
>> >> 
>> >> I am saying something needs to change to increase the number of places
>> >> the packet is delivered.
>> >> 
>> >> For the special class of devices that kobj_bcast_filter would apply to
>> >> those need to be delivered to netowrk namespaces  that are no longer on
>> >> uevent_sock_list.
>> >> 
>> >> So the code fundamentally needs to split into two paths.  Ordinary
>> >> devices that use uevent_sock_list.  Network devices that are just
>> >> delivered in their own network namespace.
>> >> 
>> >> netlink_broadcast_filtered gets to go away completely.
>> >
>> > The split *might* make sense but I think you're wrong about removing the
>> > kobj_bcast_filter. The current filter doesn't operate on the uevent
>> > socket in uevent_sock_list itself it rather operates on the sockets in
>> > mc_list. And if socket in mc_list can have a different network namespace
>> > then the uevent_socket itself then your way won't work. That's why my
>> > original patch added additional filtering in there. The way I see it we
>> > need something like:
>> 
>> We already filter the sockets in the mc_list by network namespace.
>
> Oh really? That's good to know. I haven't found where in the code this
> actually happens. I thought that when netlink_bind() is called anyone
> could register themselves in mc_list.

The code in af_netlink.c does:
> static void do_one_broadcast(struct sock *sk,
>   struct netlink_broadcast_data *p)
> {
>   struct netlink_sock *nlk = nlk_sk(sk);
>   int val;
> 
>   if (p->exclude_sk == sk)
>   return;
> 
>   if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
>   !test_bit(p->group - 1, nlk->groups))
>   return;
> 
>   if (!net_eq(sock_net(sk), p->net)) {
 Here
>   if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>   return;
^^^ Here
> 
>   if (!peernet_has_id(sock_net(sk), p->net))
>   return;
> 
>   if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>CAP_NET_BROADCAST))
>   return;
>   }

Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
you out if you are the wrong network namespace.


>> When a packet is transmitted with netlink_broadcast it is only
>> transmitted within a single network namespace.
>> 
>> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
>> with it's source network namespace so no confusion will result, and the
>> permission checks have been done to make it safe. So you can safely
>> ignore that case.  Please ignore that case.  It only needs to be
>> considered if refactoring af_netlink.c
>> 
>> When I added netlink_broadcast_filtered I imagined that we would need
>> code that worked across network namespaces that worked for different
>> namespaces.   So it looked like we would need the level of granularity
>> that you can get with netlink_broadcast_filtered.  It turns out we don't
>> and that it was a case of over design.  As the only split we care about
>> is per network namespace there is no need for
>> netlink_broadcast_filtered.
>> 
>> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
>> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
>> >
>> > The question that remains is whether we can rely on the network
>> > namespace information we can gather from the kobject_ns_type_operations
>> > to decide where we want to broadcast that event to. So something
>> > *like*:
>> 
>> We can.  We already do.  That is what kobj_bcast_filter implements.
>> 
>> >ops = kobj_ns_ops(kobj);
>> >if (!ops && kobj->kset) {
>> >struct kobject *ksobj = 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Christian Brauner
On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner  writes:
> >> 
> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  
> >> > wrote:
> >> >
> >> >  Bah. This code is obviously correct and probably wrong.
> >> >
> >> >  How do we deliver uevents for network devices that are outside of the
> >> >  initial user namespace? The kernel still needs to deliver those.
> >> >
> >> >  The logic to figure out which network namespace a device needs to be
> >> >  delivered to is is present in kobj_bcast_filter. That logic will almost
> >> >  certainly need to be turned inside out. Sign not as easy as I would
> >> >  have hoped.
> >> >
> >> > My first patch that we discussed put additional filtering logic into 
> >> > kobj_bcast_filter for that very reason. But I can move that logic
> >> > out and come up with a new patch.
> >> 
> >> I may have mis-understood.
> >> 
> >> I heard and am still hearing additional filtering to reduce the places
> >> the packet is delievered.
> >> 
> >> I am saying something needs to change to increase the number of places
> >> the packet is delivered.
> >> 
> >> For the special class of devices that kobj_bcast_filter would apply to
> >> those need to be delivered to netowrk namespaces  that are no longer on
> >> uevent_sock_list.
> >> 
> >> So the code fundamentally needs to split into two paths.  Ordinary
> >> devices that use uevent_sock_list.  Network devices that are just
> >> delivered in their own network namespace.
> >> 
> >> netlink_broadcast_filtered gets to go away completely.
> >
> > The split *might* make sense but I think you're wrong about removing the
> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> > socket in uevent_sock_list itself it rather operates on the sockets in
> > mc_list. And if socket in mc_list can have a different network namespace
> > then the uevent_socket itself then your way won't work. That's why my
> > original patch added additional filtering in there. The way I see it we
> > need something like:
> 
> We already filter the sockets in the mc_list by network namespace.

Oh really? That's good to know. I haven't found where in the code this
actually happens. I thought that when netlink_bind() is called anyone
could register themselves in mc_list.

> 
> When a packet is transmitted with netlink_broadcast it is only
> transmitted within a single network namespace.
> 
> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> with it's source network namespace so no confusion will result, and the
> permission checks have been done to make it safe. So you can safely
> ignore that case.  Please ignore that case.  It only needs to be
> considered if refactoring af_netlink.c
> 
> When I added netlink_broadcast_filtered I imagined that we would need
> code that worked across network namespaces that worked for different
> namespaces.   So it looked like we would need the level of granularity
> that you can get with netlink_broadcast_filtered.  It turns out we don't
> and that it was a case of over design.  As the only split we care about
> is per network namespace there is no need for
> netlink_broadcast_filtered.
> 
> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
> >
> > The question that remains is whether we can rely on the network
> > namespace information we can gather from the kobject_ns_type_operations
> > to decide where we want to broadcast that event to. So something
> > *like*:
> 
> We can.  We already do.  That is what kobj_bcast_filter implements.
> 
> > ops = kobj_ns_ops(kobj);
> > if (!ops && kobj->kset) {
> > struct kobject *ksobj = >kset->kobj;
> > if (ksobj->parent != NULL)
> > ops = kobj_ns_ops(ksobj->parent);
> > }
> >
> > if (ops && ops->netlink_ns && kobj->ktype->namespace)
> > if (ops->type == KOBJ_NS_TYPE_NET)
> > net = kobj->ktype->namespace(kobj);
> 
> Please note the only entry in the enumeration in the kobj_ns_type
> enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET.  So the
> check for ops->type in this case is redundant.

Yes, I know the reason for doing it explicitly is to block the case
where kobjects get tagged with other namespaces. So we'd need to be
vigilant should that ever happen but fine.

> 
> That is something else that could be simplifed.  At the time it was the
> necessary to get the sysfs changes merged.
> 
> > if (!net || net->user_ns == _user_ns)
> > ret = init_user_ns_broadcast(env, action_string, devpath);
> > else
> > ret = user_ns_broadcast(net->uevent_sock->sk, env,
> > action_string, 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Christian Brauner
On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner  writes:
> >> 
> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  
> >> > wrote:
> >> >
> >> >  Bah. This code is obviously correct and probably wrong.
> >> >
> >> >  How do we deliver uevents for network devices that are outside of the
> >> >  initial user namespace? The kernel still needs to deliver those.
> >> >
> >> >  The logic to figure out which network namespace a device needs to be
> >> >  delivered to is is present in kobj_bcast_filter. That logic will almost
> >> >  certainly need to be turned inside out. Sign not as easy as I would
> >> >  have hoped.
> >> >
> >> > My first patch that we discussed put additional filtering logic into 
> >> > kobj_bcast_filter for that very reason. But I can move that logic
> >> > out and come up with a new patch.
> >> 
> >> I may have mis-understood.
> >> 
> >> I heard and am still hearing additional filtering to reduce the places
> >> the packet is delievered.
> >> 
> >> I am saying something needs to change to increase the number of places
> >> the packet is delivered.
> >> 
> >> For the special class of devices that kobj_bcast_filter would apply to
> >> those need to be delivered to netowrk namespaces  that are no longer on
> >> uevent_sock_list.
> >> 
> >> So the code fundamentally needs to split into two paths.  Ordinary
> >> devices that use uevent_sock_list.  Network devices that are just
> >> delivered in their own network namespace.
> >> 
> >> netlink_broadcast_filtered gets to go away completely.
> >
> > The split *might* make sense but I think you're wrong about removing the
> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> > socket in uevent_sock_list itself it rather operates on the sockets in
> > mc_list. And if socket in mc_list can have a different network namespace
> > then the uevent_socket itself then your way won't work. That's why my
> > original patch added additional filtering in there. The way I see it we
> > need something like:
> 
> We already filter the sockets in the mc_list by network namespace.

Oh really? That's good to know. I haven't found where in the code this
actually happens. I thought that when netlink_bind() is called anyone
could register themselves in mc_list.

> 
> When a packet is transmitted with netlink_broadcast it is only
> transmitted within a single network namespace.
> 
> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> with it's source network namespace so no confusion will result, and the
> permission checks have been done to make it safe. So you can safely
> ignore that case.  Please ignore that case.  It only needs to be
> considered if refactoring af_netlink.c
> 
> When I added netlink_broadcast_filtered I imagined that we would need
> code that worked across network namespaces that worked for different
> namespaces.   So it looked like we would need the level of granularity
> that you can get with netlink_broadcast_filtered.  It turns out we don't
> and that it was a case of over design.  As the only split we care about
> is per network namespace there is no need for
> netlink_broadcast_filtered.
> 
> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
> >
> > The question that remains is whether we can rely on the network
> > namespace information we can gather from the kobject_ns_type_operations
> > to decide where we want to broadcast that event to. So something
> > *like*:
> 
> We can.  We already do.  That is what kobj_bcast_filter implements.
> 
> > ops = kobj_ns_ops(kobj);
> > if (!ops && kobj->kset) {
> > struct kobject *ksobj = >kset->kobj;
> > if (ksobj->parent != NULL)
> > ops = kobj_ns_ops(ksobj->parent);
> > }
> >
> > if (ops && ops->netlink_ns && kobj->ktype->namespace)
> > if (ops->type == KOBJ_NS_TYPE_NET)
> > net = kobj->ktype->namespace(kobj);
> 
> Please note the only entry in the enumeration in the kobj_ns_type
> enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET.  So the
> check for ops->type in this case is redundant.

Yes, I know the reason for doing it explicitly is to block the case
where kobjects get tagged with other namespaces. So we'd need to be
vigilant should that ever happen but fine.

> 
> That is something else that could be simplifed.  At the time it was the
> necessary to get the sysfs changes merged.
> 
> > if (!net || net->user_ns == _user_ns)
> > ret = init_user_ns_broadcast(env, action_string, devpath);
> > else
> > ret = user_ns_broadcast(net->uevent_sock->sk, env,
> > action_string, devpath);
> 
> Almost.
> 
>   if (!net)
>   

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Eric W. Biederman
Christian Brauner  writes:

> On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  
>> > wrote:
>> >
>> >  Bah. This code is obviously correct and probably wrong.
>> >
>> >  How do we deliver uevents for network devices that are outside of the
>> >  initial user namespace? The kernel still needs to deliver those.
>> >
>> >  The logic to figure out which network namespace a device needs to be
>> >  delivered to is is present in kobj_bcast_filter. That logic will almost
>> >  certainly need to be turned inside out. Sign not as easy as I would
>> >  have hoped.
>> >
>> > My first patch that we discussed put additional filtering logic into 
>> > kobj_bcast_filter for that very reason. But I can move that logic
>> > out and come up with a new patch.
>> 
>> I may have mis-understood.
>> 
>> I heard and am still hearing additional filtering to reduce the places
>> the packet is delievered.
>> 
>> I am saying something needs to change to increase the number of places
>> the packet is delivered.
>> 
>> For the special class of devices that kobj_bcast_filter would apply to
>> those need to be delivered to netowrk namespaces  that are no longer on
>> uevent_sock_list.
>> 
>> So the code fundamentally needs to split into two paths.  Ordinary
>> devices that use uevent_sock_list.  Network devices that are just
>> delivered in their own network namespace.
>> 
>> netlink_broadcast_filtered gets to go away completely.
>
> The split *might* make sense but I think you're wrong about removing the
> kobj_bcast_filter. The current filter doesn't operate on the uevent
> socket in uevent_sock_list itself it rather operates on the sockets in
> mc_list. And if socket in mc_list can have a different network namespace
> then the uevent_socket itself then your way won't work. That's why my
> original patch added additional filtering in there. The way I see it we
> need something like:

We already filter the sockets in the mc_list by network namespace.

When a packet is transmitted with netlink_broadcast it is only
transmitted within a single network namespace.

Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
with it's source network namespace so no confusion will result, and the
permission checks have been done to make it safe. So you can safely
ignore that case.  Please ignore that case.  It only needs to be
considered if refactoring af_netlink.c

When I added netlink_broadcast_filtered I imagined that we would need
code that worked across network namespaces that worked for different
namespaces.   So it looked like we would need the level of granularity
that you can get with netlink_broadcast_filtered.  It turns out we don't
and that it was a case of over design.  As the only split we care about
is per network namespace there is no need for
netlink_broadcast_filtered.

> init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
>
> The question that remains is whether we can rely on the network
> namespace information we can gather from the kobject_ns_type_operations
> to decide where we want to broadcast that event to. So something
> *like*:

We can.  We already do.  That is what kobj_bcast_filter implements.

>   ops = kobj_ns_ops(kobj);
>   if (!ops && kobj->kset) {
>   struct kobject *ksobj = >kset->kobj;
>   if (ksobj->parent != NULL)
>   ops = kobj_ns_ops(ksobj->parent);
>   }
>
>   if (ops && ops->netlink_ns && kobj->ktype->namespace)
>   if (ops->type == KOBJ_NS_TYPE_NET)
>   net = kobj->ktype->namespace(kobj);

Please note the only entry in the enumeration in the kobj_ns_type
enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET.  So the
check for ops->type in this case is redundant.

That is something else that could be simplifed.  At the time it was the
necessary to get the sysfs changes merged.

>   if (!net || net->user_ns == _user_ns)
>   ret = init_user_ns_broadcast(env, action_string, devpath);
>   else
>   ret = user_ns_broadcast(net->uevent_sock->sk, env,
>   action_string, devpath);

Almost.

if (!net)
kobject_uevent_net_broadcast(kobj, env, action_string,
dev_path);
else
netlink_broadcast(net->uevent_sock->sk, skb, 0, 1, GFP_KERNEL);


I am handwaving to get the skb in the netlink_broadcast case but that
should be enough for you to see what I am thinking.

My only concern with the above is that we almost certainly need to fix
the credentials on the skb so that userspace does not drop the packet
sent to a network namespace because it has the credentials that will
cause 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Eric W. Biederman
Christian Brauner  writes:

> On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  
>> > wrote:
>> >
>> >  Bah. This code is obviously correct and probably wrong.
>> >
>> >  How do we deliver uevents for network devices that are outside of the
>> >  initial user namespace? The kernel still needs to deliver those.
>> >
>> >  The logic to figure out which network namespace a device needs to be
>> >  delivered to is is present in kobj_bcast_filter. That logic will almost
>> >  certainly need to be turned inside out. Sign not as easy as I would
>> >  have hoped.
>> >
>> > My first patch that we discussed put additional filtering logic into 
>> > kobj_bcast_filter for that very reason. But I can move that logic
>> > out and come up with a new patch.
>> 
>> I may have mis-understood.
>> 
>> I heard and am still hearing additional filtering to reduce the places
>> the packet is delievered.
>> 
>> I am saying something needs to change to increase the number of places
>> the packet is delivered.
>> 
>> For the special class of devices that kobj_bcast_filter would apply to
>> those need to be delivered to netowrk namespaces  that are no longer on
>> uevent_sock_list.
>> 
>> So the code fundamentally needs to split into two paths.  Ordinary
>> devices that use uevent_sock_list.  Network devices that are just
>> delivered in their own network namespace.
>> 
>> netlink_broadcast_filtered gets to go away completely.
>
> The split *might* make sense but I think you're wrong about removing the
> kobj_bcast_filter. The current filter doesn't operate on the uevent
> socket in uevent_sock_list itself it rather operates on the sockets in
> mc_list. And if socket in mc_list can have a different network namespace
> then the uevent_socket itself then your way won't work. That's why my
> original patch added additional filtering in there. The way I see it we
> need something like:

We already filter the sockets in the mc_list by network namespace.

When a packet is transmitted with netlink_broadcast it is only
transmitted within a single network namespace.

Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
with it's source network namespace so no confusion will result, and the
permission checks have been done to make it safe. So you can safely
ignore that case.  Please ignore that case.  It only needs to be
considered if refactoring af_netlink.c

When I added netlink_broadcast_filtered I imagined that we would need
code that worked across network namespaces that worked for different
namespaces.   So it looked like we would need the level of granularity
that you can get with netlink_broadcast_filtered.  It turns out we don't
and that it was a case of over design.  As the only split we care about
is per network namespace there is no need for
netlink_broadcast_filtered.

> init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
>
> The question that remains is whether we can rely on the network
> namespace information we can gather from the kobject_ns_type_operations
> to decide where we want to broadcast that event to. So something
> *like*:

We can.  We already do.  That is what kobj_bcast_filter implements.

>   ops = kobj_ns_ops(kobj);
>   if (!ops && kobj->kset) {
>   struct kobject *ksobj = >kset->kobj;
>   if (ksobj->parent != NULL)
>   ops = kobj_ns_ops(ksobj->parent);
>   }
>
>   if (ops && ops->netlink_ns && kobj->ktype->namespace)
>   if (ops->type == KOBJ_NS_TYPE_NET)
>   net = kobj->ktype->namespace(kobj);

Please note the only entry in the enumeration in the kobj_ns_type
enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET.  So the
check for ops->type in this case is redundant.

That is something else that could be simplifed.  At the time it was the
necessary to get the sysfs changes merged.

>   if (!net || net->user_ns == _user_ns)
>   ret = init_user_ns_broadcast(env, action_string, devpath);
>   else
>   ret = user_ns_broadcast(net->uevent_sock->sk, env,
>   action_string, devpath);

Almost.

if (!net)
kobject_uevent_net_broadcast(kobj, env, action_string,
dev_path);
else
netlink_broadcast(net->uevent_sock->sk, skb, 0, 1, GFP_KERNEL);


I am handwaving to get the skb in the netlink_broadcast case but that
should be enough for you to see what I am thinking.

My only concern with the above is that we almost certainly need to fix
the credentials on the skb so that userspace does not drop the packet
sent to a network namespace because it has the credentials that will
cause userspace to drop the packet today.

But it should be straight forward to look at 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Christian Brauner
On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  wrote:
> >
> >  Bah. This code is obviously correct and probably wrong.
> >
> >  How do we deliver uevents for network devices that are outside of the
> >  initial user namespace? The kernel still needs to deliver those.
> >
> >  The logic to figure out which network namespace a device needs to be
> >  delivered to is is present in kobj_bcast_filter. That logic will almost
> >  certainly need to be turned inside out. Sign not as easy as I would
> >  have hoped.
> >
> > My first patch that we discussed put additional filtering logic into 
> > kobj_bcast_filter for that very reason. But I can move that logic
> > out and come up with a new patch.
> 
> I may have mis-understood.
> 
> I heard and am still hearing additional filtering to reduce the places
> the packet is delievered.
> 
> I am saying something needs to change to increase the number of places
> the packet is delivered.
> 
> For the special class of devices that kobj_bcast_filter would apply to
> those need to be delivered to netowrk namespaces  that are no longer on
> uevent_sock_list.
> 
> So the code fundamentally needs to split into two paths.  Ordinary
> devices that use uevent_sock_list.  Network devices that are just
> delivered in their own network namespace.
> 
> netlink_broadcast_filtered gets to go away completely.

The split *might* make sense but I think you're wrong about removing the
kobj_bcast_filter. The current filter doesn't operate on the uevent
socket in uevent_sock_list itself it rather operates on the sockets in
mc_list. And if socket in mc_list can have a different network namespace
then the uevent_socket itself then your way won't work. That's why my
original patch added additional filtering in there. The way I see it we
need something like:

init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);

The question that remains is whether we can rely on the network
namespace information we can gather from the kobject_ns_type_operations
to decide where we want to broadcast that event to. So something *like*:

ops = kobj_ns_ops(kobj);
if (!ops && kobj->kset) {
struct kobject *ksobj = >kset->kobj;
if (ksobj->parent != NULL)
ops = kobj_ns_ops(ksobj->parent);
}

if (ops && ops->netlink_ns && kobj->ktype->namespace)
if (ops->type == KOBJ_NS_TYPE_NET)
net = kobj->ktype->namespace(kobj);

if (!net || net->user_ns == _user_ns)
ret = init_user_ns_broadcast(env, action_string, devpath);
else
ret = user_ns_broadcast(net->uevent_sock->sk, env,
action_string, devpath);

Christian

> The logic of figuring out the network namespace though becomes trickier.
> 
> Now it may make sense to have all of that as an additional patch on top
> of this one or perhaps a precursor patch that addresses the problem.  We
> will unfortunately drop those uevents today because their uids are not
> valid.  But they are not delivered anywhere else so to allow them to be
> received we need to fix them.
> 
> Eric
> 
> >
> >  Christian Brauner  writes:
> >  > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> > namespaces")
> >  >
> >  > enabled sending hotplug events into all network namespaces back in 2010.
> >  > Over time the set of uevents that get sent into all network namespaces 
> > has
> >  > shrunk a little. We have now reached the point where hotplug events for 
> > all
> >  > devices that carry a namespace tag are filtered according to that
> >  > namespace. Specifically, they are filtered whenever the namespace tag of
> >  > the kobject does not match the namespace tag of the netlink socket. One
> >  > example are network devices. Uevents for network devices only show up in
> >  > the network namespaces these devices are moved to or created in.
> >  >
> >  > However, any uevent for a kobject that does not have a namespace tag
> >  > associated with it will not be filtered and we will broadcast it into all
> >  > network namespaces. This behavior stopped making sense when user 
> > namespaces
> >  > were introduced.
> >  >
> >  > This patch restricts uevents to the initial user namespace for a couple 
> > of
> >  > reasons that have been extensively discusses on the mailing list [1].
> >  > - Thundering herd:
> >  > Broadcasting uevents into all network namespaces introduces significant
> >  > overhead.
> >  > All processes that listen to uevents running in non-initial user
> >  > namespaces will end up responding to uevents that will be meaningless to
> >  > them. Mainly, because non-initial user namespaces cannot 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-26 Thread Christian Brauner
On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  wrote:
> >
> >  Bah. This code is obviously correct and probably wrong.
> >
> >  How do we deliver uevents for network devices that are outside of the
> >  initial user namespace? The kernel still needs to deliver those.
> >
> >  The logic to figure out which network namespace a device needs to be
> >  delivered to is is present in kobj_bcast_filter. That logic will almost
> >  certainly need to be turned inside out. Sign not as easy as I would
> >  have hoped.
> >
> > My first patch that we discussed put additional filtering logic into 
> > kobj_bcast_filter for that very reason. But I can move that logic
> > out and come up with a new patch.
> 
> I may have mis-understood.
> 
> I heard and am still hearing additional filtering to reduce the places
> the packet is delievered.
> 
> I am saying something needs to change to increase the number of places
> the packet is delivered.
> 
> For the special class of devices that kobj_bcast_filter would apply to
> those need to be delivered to netowrk namespaces  that are no longer on
> uevent_sock_list.
> 
> So the code fundamentally needs to split into two paths.  Ordinary
> devices that use uevent_sock_list.  Network devices that are just
> delivered in their own network namespace.
> 
> netlink_broadcast_filtered gets to go away completely.

The split *might* make sense but I think you're wrong about removing the
kobj_bcast_filter. The current filter doesn't operate on the uevent
socket in uevent_sock_list itself it rather operates on the sockets in
mc_list. And if socket in mc_list can have a different network namespace
then the uevent_socket itself then your way won't work. That's why my
original patch added additional filtering in there. The way I see it we
need something like:

init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);

The question that remains is whether we can rely on the network
namespace information we can gather from the kobject_ns_type_operations
to decide where we want to broadcast that event to. So something *like*:

ops = kobj_ns_ops(kobj);
if (!ops && kobj->kset) {
struct kobject *ksobj = >kset->kobj;
if (ksobj->parent != NULL)
ops = kobj_ns_ops(ksobj->parent);
}

if (ops && ops->netlink_ns && kobj->ktype->namespace)
if (ops->type == KOBJ_NS_TYPE_NET)
net = kobj->ktype->namespace(kobj);

if (!net || net->user_ns == _user_ns)
ret = init_user_ns_broadcast(env, action_string, devpath);
else
ret = user_ns_broadcast(net->uevent_sock->sk, env,
action_string, devpath);

Christian

> The logic of figuring out the network namespace though becomes trickier.
> 
> Now it may make sense to have all of that as an additional patch on top
> of this one or perhaps a precursor patch that addresses the problem.  We
> will unfortunately drop those uevents today because their uids are not
> valid.  But they are not delivered anywhere else so to allow them to be
> received we need to fix them.
> 
> Eric
> 
> >
> >  Christian Brauner  writes:
> >  > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> > namespaces")
> >  >
> >  > enabled sending hotplug events into all network namespaces back in 2010.
> >  > Over time the set of uevents that get sent into all network namespaces 
> > has
> >  > shrunk a little. We have now reached the point where hotplug events for 
> > all
> >  > devices that carry a namespace tag are filtered according to that
> >  > namespace. Specifically, they are filtered whenever the namespace tag of
> >  > the kobject does not match the namespace tag of the netlink socket. One
> >  > example are network devices. Uevents for network devices only show up in
> >  > the network namespaces these devices are moved to or created in.
> >  >
> >  > However, any uevent for a kobject that does not have a namespace tag
> >  > associated with it will not be filtered and we will broadcast it into all
> >  > network namespaces. This behavior stopped making sense when user 
> > namespaces
> >  > were introduced.
> >  >
> >  > This patch restricts uevents to the initial user namespace for a couple 
> > of
> >  > reasons that have been extensively discusses on the mailing list [1].
> >  > - Thundering herd:
> >  > Broadcasting uevents into all network namespaces introduces significant
> >  > overhead.
> >  > All processes that listen to uevents running in non-initial user
> >  > namespaces will end up responding to uevents that will be meaningless to
> >  > them. Mainly, because non-initial user namespaces cannot easily manage
> >  > devices unless they have a privileged host-process helping them 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman
Christian Brauner  writes:

> On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  wrote:
>
>  Bah. This code is obviously correct and probably wrong.
>
>  How do we deliver uevents for network devices that are outside of the
>  initial user namespace? The kernel still needs to deliver those.
>
>  The logic to figure out which network namespace a device needs to be
>  delivered to is is present in kobj_bcast_filter. That logic will almost
>  certainly need to be turned inside out. Sign not as easy as I would
>  have hoped.
>
> My first patch that we discussed put additional filtering logic into 
> kobj_bcast_filter for that very reason. But I can move that logic
> out and come up with a new patch.

I may have mis-understood.

I heard and am still hearing additional filtering to reduce the places
the packet is delievered.

I am saying something needs to change to increase the number of places
the packet is delivered.

For the special class of devices that kobj_bcast_filter would apply to
those need to be delivered to netowrk namespaces  that are no longer on
uevent_sock_list.

So the code fundamentally needs to split into two paths.  Ordinary
devices that use uevent_sock_list.  Network devices that are just
delivered in their own network namespace.

netlink_broadcast_filtered gets to go away completely.
The logic of figuring out the network namespace though becomes trickier.

Now it may make sense to have all of that as an additional patch on top
of this one or perhaps a precursor patch that addresses the problem.  We
will unfortunately drop those uevents today because their uids are not
valid.  But they are not delivered anywhere else so to allow them to be
received we need to fix them.

Eric

>
>  Christian Brauner  writes:
>  > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> namespaces")
>  >
>  > enabled sending hotplug events into all network namespaces back in 2010.
>  > Over time the set of uevents that get sent into all network namespaces has
>  > shrunk a little. We have now reached the point where hotplug events for all
>  > devices that carry a namespace tag are filtered according to that
>  > namespace. Specifically, they are filtered whenever the namespace tag of
>  > the kobject does not match the namespace tag of the netlink socket. One
>  > example are network devices. Uevents for network devices only show up in
>  > the network namespaces these devices are moved to or created in.
>  >
>  > However, any uevent for a kobject that does not have a namespace tag
>  > associated with it will not be filtered and we will broadcast it into all
>  > network namespaces. This behavior stopped making sense when user namespaces
>  > were introduced.
>  >
>  > This patch restricts uevents to the initial user namespace for a couple of
>  > reasons that have been extensively discusses on the mailing list [1].
>  > - Thundering herd:
>  > Broadcasting uevents into all network namespaces introduces significant
>  > overhead.
>  > All processes that listen to uevents running in non-initial user
>  > namespaces will end up responding to uevents that will be meaningless to
>  > them. Mainly, because non-initial user namespaces cannot easily manage
>  > devices unless they have a privileged host-process helping them out. This
>  > means that there will be a thundering herd of activity when there
>  > shouldn't be any.
>  > - Uevents from non-root users are already filtered in userspace:
>  > Uevents are filtered by userspace in a user namespace because the
>  > received uid != 0. Instead the uid associated with the event will be
>  > 65534 == "nobody" because the global root uid is not mapped.
>  > This means we can safely and without introducing regressions modify the
>  > kernel to not send uevents into all network namespaces whose owning user
>  > namespace is not the initial user namespace because we know that
>  > userspace will ignore the message because of the uid anyway. I have
>  > a) verified that is is true for every udev implementation out there b)
>  > that this behavior has been present in all udev implementations from the
>  > very beginning.
>  > - Removing needless overhead/Increasing performance:
>  > Currently, the uevent socket for each network namespace is added to the
>  > global variable uevent_sock_list. The list itself needs to be protected
>  > by a mutex. So everytime a uevent is generated the mutex is taken on the
>  > list. The mutex is held *from the creation of the uevent (memory
>  > allocation, string creation etc. until all uevent sockets have been
>  > handled*. This is aggravated by the fact that for each uevent socket that
>  > has listeners the mc_list must be walked as well which means we're
>  > talking O(n^2) here. Given that a standard Linux workload usually has
>  > quite a lot of network namespaces and - in the face of containers - a lot
>  > of user namespaces this 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman
Christian Brauner  writes:

> On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  wrote:
>
>  Bah. This code is obviously correct and probably wrong.
>
>  How do we deliver uevents for network devices that are outside of the
>  initial user namespace? The kernel still needs to deliver those.
>
>  The logic to figure out which network namespace a device needs to be
>  delivered to is is present in kobj_bcast_filter. That logic will almost
>  certainly need to be turned inside out. Sign not as easy as I would
>  have hoped.
>
> My first patch that we discussed put additional filtering logic into 
> kobj_bcast_filter for that very reason. But I can move that logic
> out and come up with a new patch.

I may have mis-understood.

I heard and am still hearing additional filtering to reduce the places
the packet is delievered.

I am saying something needs to change to increase the number of places
the packet is delivered.

For the special class of devices that kobj_bcast_filter would apply to
those need to be delivered to netowrk namespaces  that are no longer on
uevent_sock_list.

So the code fundamentally needs to split into two paths.  Ordinary
devices that use uevent_sock_list.  Network devices that are just
delivered in their own network namespace.

netlink_broadcast_filtered gets to go away completely.
The logic of figuring out the network namespace though becomes trickier.

Now it may make sense to have all of that as an additional patch on top
of this one or perhaps a precursor patch that addresses the problem.  We
will unfortunately drop those uevents today because their uids are not
valid.  But they are not delivered anywhere else so to allow them to be
received we need to fix them.

Eric

>
>  Christian Brauner  writes:
>  > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> namespaces")
>  >
>  > enabled sending hotplug events into all network namespaces back in 2010.
>  > Over time the set of uevents that get sent into all network namespaces has
>  > shrunk a little. We have now reached the point where hotplug events for all
>  > devices that carry a namespace tag are filtered according to that
>  > namespace. Specifically, they are filtered whenever the namespace tag of
>  > the kobject does not match the namespace tag of the netlink socket. One
>  > example are network devices. Uevents for network devices only show up in
>  > the network namespaces these devices are moved to or created in.
>  >
>  > However, any uevent for a kobject that does not have a namespace tag
>  > associated with it will not be filtered and we will broadcast it into all
>  > network namespaces. This behavior stopped making sense when user namespaces
>  > were introduced.
>  >
>  > This patch restricts uevents to the initial user namespace for a couple of
>  > reasons that have been extensively discusses on the mailing list [1].
>  > - Thundering herd:
>  > Broadcasting uevents into all network namespaces introduces significant
>  > overhead.
>  > All processes that listen to uevents running in non-initial user
>  > namespaces will end up responding to uevents that will be meaningless to
>  > them. Mainly, because non-initial user namespaces cannot easily manage
>  > devices unless they have a privileged host-process helping them out. This
>  > means that there will be a thundering herd of activity when there
>  > shouldn't be any.
>  > - Uevents from non-root users are already filtered in userspace:
>  > Uevents are filtered by userspace in a user namespace because the
>  > received uid != 0. Instead the uid associated with the event will be
>  > 65534 == "nobody" because the global root uid is not mapped.
>  > This means we can safely and without introducing regressions modify the
>  > kernel to not send uevents into all network namespaces whose owning user
>  > namespace is not the initial user namespace because we know that
>  > userspace will ignore the message because of the uid anyway. I have
>  > a) verified that is is true for every udev implementation out there b)
>  > that this behavior has been present in all udev implementations from the
>  > very beginning.
>  > - Removing needless overhead/Increasing performance:
>  > Currently, the uevent socket for each network namespace is added to the
>  > global variable uevent_sock_list. The list itself needs to be protected
>  > by a mutex. So everytime a uevent is generated the mutex is taken on the
>  > list. The mutex is held *from the creation of the uevent (memory
>  > allocation, string creation etc. until all uevent sockets have been
>  > handled*. This is aggravated by the fact that for each uevent socket that
>  > has listeners the mc_list must be walked as well which means we're
>  > talking O(n^2) here. Given that a standard Linux workload usually has
>  > quite a lot of network namespaces and - in the face of containers - a lot
>  > of user namespaces this quickly becomes a performance problem (see
>  > "Thundering herd" above). By just 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Christian Brauner
On Tue, Apr 24, 2018 at 05:40:07PM -0500, Eric W. Biederman wrote:
> 
> Bah.  This code is obviously correct and probably wrong.
> 
> How do we deliver uevents for network devices that are outside of the
> initial user namespace?  The kernel still needs to deliver those.
> 
> The logic to figure out which network namespace a device needs to be
> delivered to is is present in kobj_bcast_filter.  That logic will almost
> certainly need to be turned inside out.  Sign not as easy as I would
> have hoped.

That's why my initial patch [1] added additional filtering logic to
kobj_bcast_filter(). But since we care about performance improvements as
well I can come up with a patch that moves this logic out of
kobj_bcast_filter().

Christian
[1]: https://www.spinics.net/lists/netdev/msg494487.html

> 
> Eric
> 
> Christian Brauner  writes:
> > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> > namespaces")
> >
> > enabled sending hotplug events into all network namespaces back in 2010.
> > Over time the set of uevents that get sent into all network namespaces has
> > shrunk a little. We have now reached the point where hotplug events for all
> > devices that carry a namespace tag are filtered according to that
> > namespace. Specifically, they are filtered whenever the namespace tag of
> > the kobject does not match the namespace tag of the netlink socket. One
> > example are network devices. Uevents for network devices only show up in
> > the network namespaces these devices are moved to or created in.
> >
> > However, any uevent for a kobject that does not have a namespace tag
> > associated with it will not be filtered and we will broadcast it into all
> > network namespaces. This behavior stopped making sense when user namespaces
> > were introduced.
> >
> > This patch restricts uevents to the initial user namespace for a couple of
> > reasons that have been extensively discusses on the mailing list [1].
> > - Thundering herd:
> >   Broadcasting uevents into all network namespaces introduces significant
> >   overhead.
> >   All processes that listen to uevents running in non-initial user
> >   namespaces will end up responding to uevents that will be meaningless to
> >   them. Mainly, because non-initial user namespaces cannot easily manage
> >   devices unless they have a privileged host-process helping them out. This
> >   means that there will be a thundering herd of activity when there
> >   shouldn't be any.
> > - Uevents from non-root users are already filtered in userspace:
> >   Uevents are filtered by userspace in a user namespace because the
> >   received uid != 0. Instead the uid associated with the event will be
> >   65534 == "nobody" because the global root uid is not mapped.
> >   This means we can safely and without introducing regressions modify the
> >   kernel to not send uevents into all network namespaces whose owning user
> >   namespace is not the initial user namespace because we know that
> >   userspace will ignore the message because of the uid anyway. I have
> >   a) verified that is is true for every udev implementation out there b)
> >   that this behavior has been present in all udev implementations from the
> >   very beginning.
> > - Removing needless overhead/Increasing performance:
> >   Currently, the uevent socket for each network namespace is added to the
> >   global variable uevent_sock_list. The list itself needs to be protected
> >   by a mutex. So everytime a uevent is generated the mutex is taken on the
> >   list. The mutex is held *from the creation of the uevent (memory
> >   allocation, string creation etc. until all uevent sockets have been
> >   handled*. This is aggravated by the fact that for each uevent socket that
> >   has listeners the mc_list must be walked as well which means we're
> >   talking O(n^2) here. Given that a standard Linux workload usually has
> >   quite a lot of network namespaces and - in the face of containers - a lot
> >   of user namespaces this quickly becomes a performance problem (see
> >   "Thundering herd" above). By just recording uevent sockets of network
> >   namespaces that are owned by the initial user namespace we significantly
> >   increase performance in this codepath.
> > - Injecting uevents:
> >   There's a valid argument that containers might be interested in receiving
> >   device events especially if they are delegated to them by a privileged
> >   userspace process. One prime example are SR-IOV enabled devices that are
> >   explicitly designed to be handed of to other users such as VMs or
> >   containers.
> >   This use-case can now be correctly handled since
> >   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> >   introduced the ability to send uevents from userspace. As such we can let
> >   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> >   the network namespace of the netlink socket) userspace process make a
> > 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Christian Brauner
On Tue, Apr 24, 2018 at 05:40:07PM -0500, Eric W. Biederman wrote:
> 
> Bah.  This code is obviously correct and probably wrong.
> 
> How do we deliver uevents for network devices that are outside of the
> initial user namespace?  The kernel still needs to deliver those.
> 
> The logic to figure out which network namespace a device needs to be
> delivered to is is present in kobj_bcast_filter.  That logic will almost
> certainly need to be turned inside out.  Sign not as easy as I would
> have hoped.

That's why my initial patch [1] added additional filtering logic to
kobj_bcast_filter(). But since we care about performance improvements as
well I can come up with a patch that moves this logic out of
kobj_bcast_filter().

Christian
[1]: https://www.spinics.net/lists/netdev/msg494487.html

> 
> Eric
> 
> Christian Brauner  writes:
> > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> > namespaces")
> >
> > enabled sending hotplug events into all network namespaces back in 2010.
> > Over time the set of uevents that get sent into all network namespaces has
> > shrunk a little. We have now reached the point where hotplug events for all
> > devices that carry a namespace tag are filtered according to that
> > namespace. Specifically, they are filtered whenever the namespace tag of
> > the kobject does not match the namespace tag of the netlink socket. One
> > example are network devices. Uevents for network devices only show up in
> > the network namespaces these devices are moved to or created in.
> >
> > However, any uevent for a kobject that does not have a namespace tag
> > associated with it will not be filtered and we will broadcast it into all
> > network namespaces. This behavior stopped making sense when user namespaces
> > were introduced.
> >
> > This patch restricts uevents to the initial user namespace for a couple of
> > reasons that have been extensively discusses on the mailing list [1].
> > - Thundering herd:
> >   Broadcasting uevents into all network namespaces introduces significant
> >   overhead.
> >   All processes that listen to uevents running in non-initial user
> >   namespaces will end up responding to uevents that will be meaningless to
> >   them. Mainly, because non-initial user namespaces cannot easily manage
> >   devices unless they have a privileged host-process helping them out. This
> >   means that there will be a thundering herd of activity when there
> >   shouldn't be any.
> > - Uevents from non-root users are already filtered in userspace:
> >   Uevents are filtered by userspace in a user namespace because the
> >   received uid != 0. Instead the uid associated with the event will be
> >   65534 == "nobody" because the global root uid is not mapped.
> >   This means we can safely and without introducing regressions modify the
> >   kernel to not send uevents into all network namespaces whose owning user
> >   namespace is not the initial user namespace because we know that
> >   userspace will ignore the message because of the uid anyway. I have
> >   a) verified that is is true for every udev implementation out there b)
> >   that this behavior has been present in all udev implementations from the
> >   very beginning.
> > - Removing needless overhead/Increasing performance:
> >   Currently, the uevent socket for each network namespace is added to the
> >   global variable uevent_sock_list. The list itself needs to be protected
> >   by a mutex. So everytime a uevent is generated the mutex is taken on the
> >   list. The mutex is held *from the creation of the uevent (memory
> >   allocation, string creation etc. until all uevent sockets have been
> >   handled*. This is aggravated by the fact that for each uevent socket that
> >   has listeners the mc_list must be walked as well which means we're
> >   talking O(n^2) here. Given that a standard Linux workload usually has
> >   quite a lot of network namespaces and - in the face of containers - a lot
> >   of user namespaces this quickly becomes a performance problem (see
> >   "Thundering herd" above). By just recording uevent sockets of network
> >   namespaces that are owned by the initial user namespace we significantly
> >   increase performance in this codepath.
> > - Injecting uevents:
> >   There's a valid argument that containers might be interested in receiving
> >   device events especially if they are delegated to them by a privileged
> >   userspace process. One prime example are SR-IOV enabled devices that are
> >   explicitly designed to be handed of to other users such as VMs or
> >   containers.
> >   This use-case can now be correctly handled since
> >   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> >   introduced the ability to send uevents from userspace. As such we can let
> >   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> >   the network namespace of the netlink socket) userspace process make a
> >   decision what uevents 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman

Bah.  This code is obviously correct and probably wrong.

How do we deliver uevents for network devices that are outside of the
initial user namespace?  The kernel still needs to deliver those.

The logic to figure out which network namespace a device needs to be
delivered to is is present in kobj_bcast_filter.  That logic will almost
certainly need to be turned inside out.  Sign not as easy as I would
have hoped.

Eric

Christian Brauner  writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman

Bah.  This code is obviously correct and probably wrong.

How do we deliver uevents for network devices that are outside of the
initial user namespace?  The kernel still needs to deliver those.

The logic to figure out which network namespace a device needs to be
delivered to is is present in kobj_bcast_filter.  That logic will almost
certainly need to be turned inside out.  Sign not as easy as I would
have hoped.

Eric

Christian Brauner  writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that is opening the uevent socket 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman

We already do this in practice in userspace.  It doesn't make much
sense to perform this delivery.  So we might as well make this optimization.

Christian Brauner  writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that is opening the uevent socket in that network namespace can still
>   listen to uevents.
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> Signed-off-by: Christian Brauner 
Acked-by: "Eric W. Biederman" 

> ---
> Changelog v1->v2:

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman

We already do this in practice in userspace.  It doesn't make much
sense to perform this delivery.  So we might as well make this optimization.

Christian Brauner  writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that is opening the uevent socket in that network namespace can still
>   listen to uevents.
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> Signed-off-by: Christian Brauner 
Acked-by: "Eric W. Biederman" 

> ---
> Changelog v1->v2:
> * patch unchanged
> Changelog v0->v1:
> * patch unchanged
> ---
>  

[PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Christian Brauner
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
  Broadcasting uevents into all network namespaces introduces significant
  overhead.
  All processes that listen to uevents running in non-initial user
  namespaces will end up responding to uevents that will be meaningless to
  them. Mainly, because non-initial user namespaces cannot easily manage
  devices unless they have a privileged host-process helping them out. This
  means that there will be a thundering herd of activity when there
  shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
  Uevents are filtered by userspace in a user namespace because the
  received uid != 0. Instead the uid associated with the event will be
  65534 == "nobody" because the global root uid is not mapped.
  This means we can safely and without introducing regressions modify the
  kernel to not send uevents into all network namespaces whose owning user
  namespace is not the initial user namespace because we know that
  userspace will ignore the message because of the uid anyway. I have
  a) verified that is is true for every udev implementation out there b)
  that this behavior has been present in all udev implementations from the
  very beginning.
- Removing needless overhead/Increasing performance:
  Currently, the uevent socket for each network namespace is added to the
  global variable uevent_sock_list. The list itself needs to be protected
  by a mutex. So everytime a uevent is generated the mutex is taken on the
  list. The mutex is held *from the creation of the uevent (memory
  allocation, string creation etc. until all uevent sockets have been
  handled*. This is aggravated by the fact that for each uevent socket that
  has listeners the mc_list must be walked as well which means we're
  talking O(n^2) here. Given that a standard Linux workload usually has
  quite a lot of network namespaces and - in the face of containers - a lot
  of user namespaces this quickly becomes a performance problem (see
  "Thundering herd" above). By just recording uevent sockets of network
  namespaces that are owned by the initial user namespace we significantly
  increase performance in this codepath.
- Injecting uevents:
  There's a valid argument that containers might be interested in receiving
  device events especially if they are delegated to them by a privileged
  userspace process. One prime example are SR-IOV enabled devices that are
  explicitly designed to be handed of to other users such as VMs or
  containers.
  This use-case can now be correctly handled since
  commit 692ec06d7c92 ("netns: send uevent messages"). This commit
  introduced the ability to send uevents from userspace. As such we can let
  a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
  the network namespace of the netlink socket) userspace process make a
  decision what uevents should be sent. This removes the need to blindly
  broadcast uevents into all user namespaces and provides a performant and
  safe solution to this problem.
- Filtering logic:
  This patch filters by *owning user namespace of the network namespace a
  given task resides in* and not by user namespace of the task per se. This
  means if the user namespace of a given task is unshared but the network
  namespace is kept and is owned by the initial user namespace a listener
  that is opening the uevent socket in that network namespace can still
  listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner 
---
Changelog v1->v2:
* patch unchanged
Changelog v0->v1:
* patch unchanged
---
 lib/kobject_uevent.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
 
net->uevent_sock = ue_sk;
 
- 

[PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Christian Brauner
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
  Broadcasting uevents into all network namespaces introduces significant
  overhead.
  All processes that listen to uevents running in non-initial user
  namespaces will end up responding to uevents that will be meaningless to
  them. Mainly, because non-initial user namespaces cannot easily manage
  devices unless they have a privileged host-process helping them out. This
  means that there will be a thundering herd of activity when there
  shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
  Uevents are filtered by userspace in a user namespace because the
  received uid != 0. Instead the uid associated with the event will be
  65534 == "nobody" because the global root uid is not mapped.
  This means we can safely and without introducing regressions modify the
  kernel to not send uevents into all network namespaces whose owning user
  namespace is not the initial user namespace because we know that
  userspace will ignore the message because of the uid anyway. I have
  a) verified that is is true for every udev implementation out there b)
  that this behavior has been present in all udev implementations from the
  very beginning.
- Removing needless overhead/Increasing performance:
  Currently, the uevent socket for each network namespace is added to the
  global variable uevent_sock_list. The list itself needs to be protected
  by a mutex. So everytime a uevent is generated the mutex is taken on the
  list. The mutex is held *from the creation of the uevent (memory
  allocation, string creation etc. until all uevent sockets have been
  handled*. This is aggravated by the fact that for each uevent socket that
  has listeners the mc_list must be walked as well which means we're
  talking O(n^2) here. Given that a standard Linux workload usually has
  quite a lot of network namespaces and - in the face of containers - a lot
  of user namespaces this quickly becomes a performance problem (see
  "Thundering herd" above). By just recording uevent sockets of network
  namespaces that are owned by the initial user namespace we significantly
  increase performance in this codepath.
- Injecting uevents:
  There's a valid argument that containers might be interested in receiving
  device events especially if they are delegated to them by a privileged
  userspace process. One prime example are SR-IOV enabled devices that are
  explicitly designed to be handed of to other users such as VMs or
  containers.
  This use-case can now be correctly handled since
  commit 692ec06d7c92 ("netns: send uevent messages"). This commit
  introduced the ability to send uevents from userspace. As such we can let
  a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
  the network namespace of the netlink socket) userspace process make a
  decision what uevents should be sent. This removes the need to blindly
  broadcast uevents into all user namespaces and provides a performant and
  safe solution to this problem.
- Filtering logic:
  This patch filters by *owning user namespace of the network namespace a
  given task resides in* and not by user namespace of the task per se. This
  means if the user namespace of a given task is unshared but the network
  namespace is kept and is owned by the initial user namespace a listener
  that is opening the uevent socket in that network namespace can still
  listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner 
---
Changelog v1->v2:
* patch unchanged
Changelog v0->v1:
* patch unchanged
---
 lib/kobject_uevent.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
 
net->uevent_sock = ue_sk;
 
-