Re: [PATCH net-next v3] net-loopback: set lo dev initial state to UP

2021-02-09 Thread महेश बंडेवार
On Tue, Feb 9, 2021 at 11:06 AM Ido Schimmel  wrote:
>
> On Tue, Feb 09, 2021 at 10:49:23AM -0800, Mahesh Bandewar (महेश बंडेवार) 
> wrote:
> > On Tue, Feb 9, 2021 at 8:23 AM Jakub Kicinski  wrote:
> > >
> > > On Tue, 9 Feb 2021 12:54:59 +0100 Petr Machata wrote:
> > > > Jian Yang  writes:
> > > >
> > > > > From: Jian Yang 
> > > > >
> > > > > Traditionally loopback devices come up with initial state as DOWN for
> > > > > any new network-namespace. This would mean that anyone needing this
> > > > > device would have to bring this UP by issuing something like 'ip link
> > > > > set lo up'. This can be avoided if the initial state is set as UP.
> > > >
> > > > This will break user scripts, and it fact breaks kernel's very own
> > > > selftest. We currently have this internally:
> > > >
> > > > diff --git a/tools/testing/selftests/net/fib_nexthops.sh 
> > > > b/tools/testing/selftests/net/fib_nexthops.sh
> > > > index 4c7d33618437..bf8ed24ab3ba 100755
> > > > --- a/tools/testing/selftests/net/fib_nexthops.sh
> > > > +++ b/tools/testing/selftests/net/fib_nexthops.sh
> > > > @@ -121,8 +121,6 @@ create_ns()
> > > >   set -e
> > > >   ip netns add ${n}
> > > >   ip netns set ${n} $((nsid++))
> > > > - ip -netns ${n} addr add 127.0.0.1/8 dev lo
> > > > - ip -netns ${n} link set lo up
> > > >
> > > >   ip netns exec ${n} sysctl -qw net.ipv4.ip_forward=1
> > > >   ip netns exec ${n} sysctl -qw net.ipv4.fib_multipath_use_neigh=1
> > > >
> > > > This now fails because the ip commands are run within a "set -e" block,
> > > > and kernel rejects addition of a duplicate address.
> > >
> > > Thanks for the report, could you send a revert with this explanation?
> > Rather than revert, shouldn't we just fix the self-test in that regard?
>
> I reviewed such a patch internally and asked Petr to report it as a
> regression instead. At the time the new behavior was added under a
> sysctl, but nobody had examples for behavior that will break, so the
> sysctl was removed. Now we have such an example, so the revert / sysctl
> are needed.

OK, in that case I would prefer to send an incremental patch to
enclose the new behavior with the sysctl (proposed earlier) rather
than the revert. Would that help?


Re: [PATCH net-next v3] net-loopback: set lo dev initial state to UP

2021-02-09 Thread महेश बंडेवार
On Tue, Feb 9, 2021 at 11:04 AM Jakub Kicinski  wrote:
>
> On Tue, 9 Feb 2021 10:49:23 -0800 Mahesh Bandewar (महेश बंडेवार) wrote:
> > On Tue, Feb 9, 2021 at 8:23 AM Jakub Kicinski  wrote:
> > > On Tue, 9 Feb 2021 12:54:59 +0100 Petr Machata wrote:
> > > > This will break user scripts, and it fact breaks kernel's very own
> > > > selftest. We currently have this internally:
> > > >
> > > > diff --git a/tools/testing/selftests/net/fib_nexthops.sh 
> > > > b/tools/testing/selftests/net/fib_nexthops.sh
> > > > index 4c7d33618437..bf8ed24ab3ba 100755
> > > > --- a/tools/testing/selftests/net/fib_nexthops.sh
> > > > +++ b/tools/testing/selftests/net/fib_nexthops.sh
> > > > @@ -121,8 +121,6 @@ create_ns()
> > > >   set -e
> > > >   ip netns add ${n}
> > > >   ip netns set ${n} $((nsid++))
> > > > - ip -netns ${n} addr add 127.0.0.1/8 dev lo
> > > > - ip -netns ${n} link set lo up
> > > >
> > > >   ip netns exec ${n} sysctl -qw net.ipv4.ip_forward=1
> > > >   ip netns exec ${n} sysctl -qw net.ipv4.fib_multipath_use_neigh=1
> > > >
> > > > This now fails because the ip commands are run within a "set -e" block,
> > > > and kernel rejects addition of a duplicate address.
> > >
> > > Thanks for the report, could you send a revert with this explanation?
> > Rather than revert, shouldn't we just fix the self-test in that regard?
>
> The selftest is just a messenger. We all know Linus's stand on
> regressions, IMO we can't make an honest argument that the change
> does not break user space after it broke our own selftest. Maybe
> others disagree..

Actually that was the reason behind encompassing this behavior change
with a sysctl.


Re: [PATCH net-next v3] net-loopback: set lo dev initial state to UP

2021-02-09 Thread महेश बंडेवार
On Tue, Feb 9, 2021 at 8:23 AM Jakub Kicinski  wrote:
>
> On Tue, 9 Feb 2021 12:54:59 +0100 Petr Machata wrote:
> > Jian Yang  writes:
> >
> > > From: Jian Yang 
> > >
> > > Traditionally loopback devices come up with initial state as DOWN for
> > > any new network-namespace. This would mean that anyone needing this
> > > device would have to bring this UP by issuing something like 'ip link
> > > set lo up'. This can be avoided if the initial state is set as UP.
> >
> > This will break user scripts, and it fact breaks kernel's very own
> > selftest. We currently have this internally:
> >
> > diff --git a/tools/testing/selftests/net/fib_nexthops.sh 
> > b/tools/testing/selftests/net/fib_nexthops.sh
> > index 4c7d33618437..bf8ed24ab3ba 100755
> > --- a/tools/testing/selftests/net/fib_nexthops.sh
> > +++ b/tools/testing/selftests/net/fib_nexthops.sh
> > @@ -121,8 +121,6 @@ create_ns()
> >   set -e
> >   ip netns add ${n}
> >   ip netns set ${n} $((nsid++))
> > - ip -netns ${n} addr add 127.0.0.1/8 dev lo
> > - ip -netns ${n} link set lo up
> >
> >   ip netns exec ${n} sysctl -qw net.ipv4.ip_forward=1
> >   ip netns exec ${n} sysctl -qw net.ipv4.fib_multipath_use_neigh=1
> >
> > This now fails because the ip commands are run within a "set -e" block,
> > and kernel rejects addition of a duplicate address.
>
> Thanks for the report, could you send a revert with this explanation?
Rather than revert, shouldn't we just fix the self-test in that regard?


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-12-02 Thread महेश बंडेवार
On Tue, Dec 1, 2020 at 6:38 PM Jakub Kicinski  wrote:
>
> On Tue, 1 Dec 2020 12:24:38 -0800 Mahesh Bandewar (महेश बंडेवार) wrote:
> > On Thu, Nov 19, 2020 at 8:56 PM Jakub Kicinski  wrote:
> > > Do you have more details on what the use cases are that expect no
> > > networking?
> > >
> > > TBH I don't get the utility of this knob. If you want to write vaguely
> > > portable software you have to assume the knob won't be useful, because
> > > either (a) kernel may be old, or (b) you shouldn't assume to own the
> > > sysctls and this is a global one (what if an application spawns that
> > > expects legacy behavior?)
> > >
> > > And if you have to check for those two things you're gonna write more
> > > code than just ifuping lo would be.
> > >
> > > Maybe you can shed some more light on how it makes life at Google
> > > easier for you? Or someone else can enlighten me?
> > >
> > > I don't have much practical experience with namespaces, but the more
> > > I think about it the more pointless it seems.
> >
> > Thanks for the input.
> >
> > Sorry, I was on vacation and couldn't process this response earlier.
> >
> > There are two things that this patch is providing and let me understand the
> > objections well.
> >
> > (a) Bring up lo by default
> > (b) sysctl to protect the legacy behavior
> >
> > Frankly we really dont have any legacy-behavior use case and one can
> > bring it back to life by just doing 'ifdown lo' if necessary.
>
> If use cases depending on legacy behavior exist we are just trading the
> ifup in one case for an ifdown in another.
>
Yes, I would agree with this if the use-cases are 50/50 but I would
say it's more like 99/1 distribution (99% of the time if not higher
times lo is required to be UP and probably 1% of the time or less it's
 down)

> Unless we can dispel the notion that sand-boxing by lo down is a
> legitimate use case IMO we're just adding complexity and growing
> a sysctl for something that can be trivially handled from user space.
>
OK, I can remove the sysctl and just have the 3 line patch.



> > Most of
> > the use cases involve using networking anyways. My belief was that we
> > need to protect legacy behavior and hence went lengths to add sysctl
> > to protect it. If we are OK not to have it, I'm more than happy to
> > remove the sysctl and just have the 3 line patch to bring loopback up.
> >
> > If legacy-behavior is a concern (which I thought generally is), then
> > either we can have the sysctl to have it around to protect it (the
> > current implementation) but if we prefer to have kernel-command-line
> > instead of sysctl that defaults to legacy behavior but if provided, we
> > can set it UP by default during boot (or the other way around)?
> >
> > My primary motive is (a) while (b) is just a side-effect which we can
> > get away if deemed unnecessary.
>


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-12-01 Thread महेश बंडेवार
On Thu, Nov 19, 2020 at 8:56 PM Jakub Kicinski  wrote:
>
> On Thu, 19 Nov 2020 19:55:08 -0800 Mahesh Bandewar (महेश बंडेवार) wrote:
> > On Thu, Nov 19, 2020 at 12:03 AM Nicolas Dichtel
> > > Le 18/11/2020 à 18:39, Mahesh Bandewar (महेश बंडेवार) a écrit :
> > > > netns but would create problems for workloads that create netns to
> > > > disable networking. One can always disable it after creating the netns
> > > > but that would mean change in the workflow and it could be viewed as
> > > > regression.
> > > The networking is very limited with only a loopback. Do you have some 
> > > real use
> > > case in mind?
> >
> > My use cases all use networking but I think principally we cannot
> > break backward compatibility, right?
> > Jakub, WDYT?
>
> Do you have more details on what the use cases are that expect no
> networking?
>
> TBH I don't get the utility of this knob. If you want to write vaguely
> portable software you have to assume the knob won't be useful, because
> either (a) kernel may be old, or (b) you shouldn't assume to own the
> sysctls and this is a global one (what if an application spawns that
> expects legacy behavior?)
>
> And if you have to check for those two things you're gonna write more
> code than just ifuping lo would be.
>
> Maybe you can shed some more light on how it makes life at Google
> easier for you? Or someone else can enlighten me?
>
> I don't have much practical experience with namespaces, but the more
> I think about it the more pointless it seems.

Thanks for the input.

Sorry, I was on vacation and couldn't process this response earlier.

There are two things that this patch is providing and let me understand the
objections well.

(a) Bring up lo by default
(b) sysctl to protect the legacy behavior

Frankly we really dont have any legacy-behavior use case and one can
bring it back to life by just doing 'ifdown lo' if necessary. Most of
the use cases involve using networking anyways. My belief was that we
need to protect legacy behavior and hence went lengths to add sysctl
to protect it. If we are OK not to have it, I'm more than happy to
remove the sysctl and just have the 3 line patch to bring loopback up.

If legacy-behavior is a concern (which I thought generally is), then
either we can have the sysctl to have it around to protect it (the
current implementation) but if we prefer to have kernel-command-line
instead of sysctl that defaults to legacy behavior but if provided, we
can set it UP by default during boot (or the other way around)?

My primary motive is (a) while (b) is just a side-effect which we can
get away if deemed unnecessary.


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-11-19 Thread महेश बंडेवार
On Thu, Nov 19, 2020 at 12:03 AM Nicolas Dichtel
 wrote:
>
> Le 18/11/2020 à 18:39, Mahesh Bandewar (महेश बंडेवार) a écrit :
> > On Wed, Nov 18, 2020 at 8:58 AM Nicolas Dichtel
> >  wrote:
> >>
> >> Le 18/11/2020 à 02:12, David Ahern a écrit :
> >> [snip]
> >>> If there is no harm in just creating lo in the up state, why not just do
> >>> it vs relying on a sysctl? It only affects 'local' networking so no real
> >>> impact to containers that do not do networking (ie., packets can't
> >>> escape). Linux has a lot of sysctl options; is this one really needed?
> >>>
> > I started with that approach but then I was informed about these
> > containers that disable networking all together including loopback.
> > Also bringing up by default would break backward compatibility hence
> > resorted to sysctl.
> >> +1
> >>
> >> And thus, it will benefit to everybody.
> >
> > Well, it benefits everyone who uses networking (most of us) inside
> Sure.
>
> > netns but would create problems for workloads that create netns to
> > disable networking. One can always disable it after creating the netns
> > but that would mean change in the workflow and it could be viewed as
> > regression.
> The networking is very limited with only a loopback. Do you have some real use
> case in mind?

My use cases all use networking but I think principally we cannot
break backward compatibility, right?
Jakub, WDYT?


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-11-18 Thread महेश बंडेवार
On Wed, Nov 18, 2020 at 10:04 AM David Ahern  wrote:
>
> On 11/18/20 10:39 AM, Mahesh Bandewar (महेश बंडेवार) wrote:
> > On Wed, Nov 18, 2020 at 8:58 AM Nicolas Dichtel
> >  wrote:
> >>
> >> Le 18/11/2020 à 02:12, David Ahern a écrit :
> >> [snip]
> >>> If there is no harm in just creating lo in the up state, why not just do
> >>> it vs relying on a sysctl? It only affects 'local' networking so no real
> >>> impact to containers that do not do networking (ie., packets can't
> >>> escape). Linux has a lot of sysctl options; is this one really needed?
> >>>
> > I started with that approach but then I was informed about these
> > containers that disable networking all together including loopback.
> > Also bringing up by default would break backward compatibility hence
> > resorted to sysctl.
> >> +1
> >>
> >> And thus, it will benefit to everybody.
> >
> > Well, it benefits everyone who uses networking (most of us) inside
> > netns but would create problems for workloads that create netns to
> > disable networking. One can always disable it after creating the netns
> > but that would mean change in the workflow and it could be viewed as
> > regression.
> >
>
> Then perhaps the relevant sysctl -- or maybe netns attribute -- is
> whether to create a loopback device at all.

I'm open to ideas within the bounds of maintaining backward
compatibility. However, not able to see how we can pull it off without
creating a 'loopback' device.


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-11-18 Thread महेश बंडेवार
On Wed, Nov 18, 2020 at 8:58 AM Nicolas Dichtel
 wrote:
>
> Le 18/11/2020 à 02:12, David Ahern a écrit :
> [snip]
> > If there is no harm in just creating lo in the up state, why not just do
> > it vs relying on a sysctl? It only affects 'local' networking so no real
> > impact to containers that do not do networking (ie., packets can't
> > escape). Linux has a lot of sysctl options; is this one really needed?
> >
I started with that approach but then I was informed about these
containers that disable networking all together including loopback.
Also bringing up by default would break backward compatibility hence
resorted to sysctl.
> +1
>
> And thus, it will benefit to everybody.

Well, it benefits everyone who uses networking (most of us) inside
netns but would create problems for workloads that create netns to
disable networking. One can always disable it after creating the netns
but that would mean change in the workflow and it could be viewed as
regression.


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-11-17 Thread महेश बंडेवार
On Tue, Nov 17, 2020 at 9:18 AM Ido Schimmel  wrote:
>
> On Mon, Nov 16, 2020 at 01:03:32PM -0800, Mahesh Bandewar (महेश बंडेवार) 
> wrote:
> > On Mon, Nov 16, 2020 at 12:34 PM Jakub Kicinski  wrote:
> > >
> > > On Mon, 16 Nov 2020 12:02:48 -0800 Mahesh Bandewar (महेश बंडेवार) wrote:
> > > > > > diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> > > > > > index a1c77cc00416..76dc92ac65a2 100644
> > > > > > --- a/drivers/net/loopback.c
> > > > > > +++ b/drivers/net/loopback.c
> > > > > > @@ -219,6 +219,13 @@ static __net_init int loopback_net_init(struct 
> > > > > > net *net)
> > > > > >
> > > > > >   BUG_ON(dev->ifindex != LOOPBACK_IFINDEX);
> > > > > >   net->loopback_dev = dev;
> > > > > > +
> > > > > > + if (sysctl_netdev_loopback_state) {
> > > > > > + /* Bring loopback device UP */
> > > > > > + rtnl_lock();
> > > > > > + dev_open(dev, NULL);
> > > > > > + rtnl_unlock();
> > > > > > + }
> > > > >
> > > > > The only concern I have here is that it breaks notification ordering.
> > > > > Is there precedent for NETDEV_UP to be generated before all pernet ops
> > > > > ->init was called?
> > > > I'm not sure if any and didn't see any issues in our usage / tests.
> > > > I'm not even sure anyone is watching/monitoring for lo status as such.
> > >
> > > Ido, David, how does this sound to you?
> > >
> > > I can't think of any particular case where bringing the device up (and
> > > populating it's addresses) before per netns init is finished could be
> > > problematic. But if this is going to make kernel coding harder the
> > > minor convenience of the knob is probably not worth it.
> >
> > +Eric Dumazet
> >
> > I'm not sure why kernel coding should get harder, but happy to listen
> > to the opinions.
>
> Hi,
>
> Sorry for the delay. Does not occur to me as a problematic change. I ran
> various tests with 'sysctl -qw net.core.netdev_loopback_state=1' and a
> debug config. Looks OK.

Thanks for the confirmation Ido. I think Jian is getting powerpc
config build fixed to address the build-bots findings and then he can
push the updated version.


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-11-16 Thread महेश बंडेवार
On Mon, Nov 16, 2020 at 1:20 PM Jakub Kicinski  wrote:
>
> On Mon, 16 Nov 2020 12:50:22 -0800 Mahesh Bandewar (महेश बंडेवार) wrote:
> > On Mon, Nov 16, 2020 at 12:17 PM Jakub Kicinski  wrote:
> > > On Mon, 16 Nov 2020 12:02:48 -0800 Mahesh Bandewar (महेश बंडेवार) wrote:
> > > > On Sat, Nov 14, 2020 at 10:17 AM Jakub Kicinski  wrote:
> > > > > On Wed, 11 Nov 2020 12:43:08 -0800 Jian Yang wrote:
> > > > > > From: Mahesh Bandewar 
> > > > > >
> > > > > > Traditionally loopback devices comes up with initial state as DOWN 
> > > > > > for
> > > > > > any new network-namespace. This would mean that anyone needing this
> > > > > > device (which is mostly true except sandboxes where networking in 
> > > > > > not
> > > > > > needed at all), would have to bring this UP by issuing something 
> > > > > > like
> > > > > > 'ip link set lo up' which can be avoided if the initial state can 
> > > > > > be set
> > > > > > as UP. Also ICMP error propagation needs loopback to be UP.
> > > > > >
> > > > > > The default value for this sysctl is set to ZERO which will 
> > > > > > preserve the
> > > > > > backward compatible behavior for the root-netns while changing the
> > > > > > sysctl will only alter the behavior of the newer network namespaces.
> > > >
> > > > > Any reason why the new sysctl itself is not per netns?
> > > > >
> > > > Making it per netns would not be very useful since its effect is only
> > > > during netns creation.
> > >
> > > I must be confused. Are all namespaces spawned off init_net, not the
> > > current netns the process is in?
> > The namespace hierarchy is maintained in user-ns while we have per-ns
> > sysctls hanging off of a netns object and we don't have parent (netns)
> > reference when initializing newly created netns values. One can copy
> > the current value of the settings from root-ns but I don't think
> > that's a good practice since there is no clear way to affect those
> > values when the root-ns changes them. Also from the isolation
> > perspective (I think) it's better to have this behavior (sysctl
> > values) stand on it's own i.e. have default values and then alter
> > values on it's own without linking to any other netns values.
>
> To be clear, what I meant was just to make the sysctl per namespace.
> That way you can deploy a workload with this sysctl set appropriately
> without changing the system-global setting.
>
> Is your expectation that particular application stacks would take
> advantage of this, or that people would set this to 1 across the
> fleet?

Having loopback DOWN is useful to only a tiny fraction of netns use
cases where networking is not enabled but since that's how it had been
historically so breaking that (default) behavior is not right. But
apart from those cases, wherever networking is used inside netns (most
of the use cases), loopback is always required to be UP otherwise your
ICMP error delivery is affected. So workloads that always use
networking inside netns would set this value to be 1 always (Google's
use case) while those workloads where there is a mix of non-networking
as well as networking enabled netns-es can use the default behavior
that has been traditionally present (where the value set to 0).


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-11-16 Thread महेश बंडेवार
On Mon, Nov 16, 2020 at 12:34 PM Jakub Kicinski  wrote:
>
> On Mon, 16 Nov 2020 12:02:48 -0800 Mahesh Bandewar (महेश बंडेवार) wrote:
> > > > diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> > > > index a1c77cc00416..76dc92ac65a2 100644
> > > > --- a/drivers/net/loopback.c
> > > > +++ b/drivers/net/loopback.c
> > > > @@ -219,6 +219,13 @@ static __net_init int loopback_net_init(struct net 
> > > > *net)
> > > >
> > > >   BUG_ON(dev->ifindex != LOOPBACK_IFINDEX);
> > > >   net->loopback_dev = dev;
> > > > +
> > > > + if (sysctl_netdev_loopback_state) {
> > > > + /* Bring loopback device UP */
> > > > + rtnl_lock();
> > > > + dev_open(dev, NULL);
> > > > + rtnl_unlock();
> > > > + }
> > >
> > > The only concern I have here is that it breaks notification ordering.
> > > Is there precedent for NETDEV_UP to be generated before all pernet ops
> > > ->init was called?
> > I'm not sure if any and didn't see any issues in our usage / tests.
> > I'm not even sure anyone is watching/monitoring for lo status as such.
>
> Ido, David, how does this sound to you?
>
> I can't think of any particular case where bringing the device up (and
> populating it's addresses) before per netns init is finished could be
> problematic. But if this is going to make kernel coding harder the
> minor convenience of the knob is probably not worth it.

+Eric Dumazet

I'm not sure why kernel coding should get harder, but happy to listen
to the opinions.


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-11-16 Thread महेश बंडेवार
On Mon, Nov 16, 2020 at 12:17 PM Jakub Kicinski  wrote:
>
> On Mon, 16 Nov 2020 12:02:48 -0800 Mahesh Bandewar (महेश बंडेवार) wrote:
> > On Sat, Nov 14, 2020 at 10:17 AM Jakub Kicinski  wrote:
> > > On Wed, 11 Nov 2020 12:43:08 -0800 Jian Yang wrote:
> > > > From: Mahesh Bandewar 
> > > >
> > > > Traditionally loopback devices comes up with initial state as DOWN for
> > > > any new network-namespace. This would mean that anyone needing this
> > > > device (which is mostly true except sandboxes where networking in not
> > > > needed at all), would have to bring this UP by issuing something like
> > > > 'ip link set lo up' which can be avoided if the initial state can be set
> > > > as UP. Also ICMP error propagation needs loopback to be UP.
> > > >
> > > > The default value for this sysctl is set to ZERO which will preserve the
> > > > backward compatible behavior for the root-netns while changing the
> > > > sysctl will only alter the behavior of the newer network namespaces.
> >
> > > Any reason why the new sysctl itself is not per netns?
> > >
> > Making it per netns would not be very useful since its effect is only
> > during netns creation.
>
> I must be confused. Are all namespaces spawned off init_net, not the
> current netns the process is in?
The namespace hierarchy is maintained in user-ns while we have per-ns
sysctls hanging off of a netns object and we don't have parent (netns)
reference when initializing newly created netns values. One can copy
the current value of the settings from root-ns but I don't think
that's a good practice since there is no clear way to affect those
values when the root-ns changes them. Also from the isolation
perspective (I think) it's better to have this behavior (sysctl
values) stand on it's own i.e. have default values and then alter
values on it's own without linking to any other netns values.


Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-11-16 Thread महेश बंडेवार
On Sat, Nov 14, 2020 at 10:17 AM Jakub Kicinski  wrote:
>
> On Wed, 11 Nov 2020 12:43:08 -0800 Jian Yang wrote:
> > From: Mahesh Bandewar 
> >
> > Traditionally loopback devices comes up with initial state as DOWN for
> > any new network-namespace. This would mean that anyone needing this
> > device (which is mostly true except sandboxes where networking in not
> > needed at all), would have to bring this UP by issuing something like
> > 'ip link set lo up' which can be avoided if the initial state can be set
> > as UP. Also ICMP error propagation needs loopback to be UP.
> >
> > The default value for this sysctl is set to ZERO which will preserve the
> > backward compatible behavior for the root-netns while changing the
> > sysctl will only alter the behavior of the newer network namespaces.
>
> Any reason why the new sysctl itself is not per netns?
>
Making it per netns would not be very useful since its effect is only
during netns creation.

> > +netdev_loopback_state
> > +-
>
> loopback_init_state ?
>
That's fine, thanks for the suggestion.

> > +Controls the loopback device initial state for any new network namespaces. 
> > By
> > +default, we keep the initial state as DOWN.
> > +
> > +If set to 1, the loopback device will be brought UP during namespace 
> > creation.
> > +This will only apply to all new network namespaces.
> > +
> > +Default : 0  (for compatibility reasons)
> > +
> >  netdev_max_backlog
> >  --
> >
> > diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> > index a1c77cc00416..76dc92ac65a2 100644
> > --- a/drivers/net/loopback.c
> > +++ b/drivers/net/loopback.c
> > @@ -219,6 +219,13 @@ static __net_init int loopback_net_init(struct net 
> > *net)
> >
> >   BUG_ON(dev->ifindex != LOOPBACK_IFINDEX);
> >   net->loopback_dev = dev;
> > +
> > + if (sysctl_netdev_loopback_state) {
> > + /* Bring loopback device UP */
> > + rtnl_lock();
> > + dev_open(dev, NULL);
> > + rtnl_unlock();
> > + }
>
> The only concern I have here is that it breaks notification ordering.
> Is there precedent for NETDEV_UP to be generated before all pernet ops
> ->init was called?
I'm not sure if any and didn't see any issues in our usage / tests.
I'm not even sure anyone is watching/monitoring for lo status as such.


Re: [PATCHv2 next] net: add option to not create fall-back tunnels in root-ns as well

2020-08-25 Thread महेश बंडेवार
On Tue, Aug 25, 2020 at 5:42 PM Maciej Żenczykowski  wrote:
>
> On Tue, Aug 25, 2020 at 4:00 PM Mahesh Bandewar (महेश बंडेवार)
>  wrote:
> >
> > On Tue, Aug 25, 2020 at 3:47 PM Randy Dunlap  wrote:
> > >
> > > On 8/25/20 3:42 PM, Mahesh Bandewar wrote:
> > > > The sysctl that was added  earlier by commit 79134e6ce2c ("net: do
> > > > not create fallback tunnels for non-default namespaces") to create
> > > > fall-back only in root-ns. This patch enhances that behavior to provide
> > > > option not to create fallback tunnels in root-ns as well. Since modules
> > > > that create fallback tunnels could be built-in and setting the sysctl
> > > > value after booting is pointless, so added a kernel cmdline options to
> > > > change this default. The default setting is preseved for backward
> > > > compatibility. The kernel command line option of fb_tunnels=initns will
> > > > set the sysctl value to 1 and will create fallback tunnels only in 
> > > > initns
> > > > while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and
> > > > fallback tunnels are skipped in every netns.
> > > >
> > > > Signed-off-by: Mahesh Bandewar 
> > > > Cc: Eric Dumazet 
> > > > Cc: Maciej Zenczykowski 
> > > > Cc: Jian Yang 
> > > > ---
> > > > v1->v2
> > > >   Removed the Kconfig option which would force rebuild and replaced with
> > > >   kcmd-line option
> > > >
> > > >  .../admin-guide/kernel-parameters.txt |  5 +
> > > >  Documentation/admin-guide/sysctl/net.rst  | 20 +--
> > > >  include/linux/netdevice.h |  7 ++-
> > > >  net/core/sysctl_net_core.c| 17 ++--
> > > >  4 files changed, 40 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > > > b/Documentation/admin-guide/kernel-parameters.txt
> > > > index a1068742a6df..09a51598c792 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -801,6 +801,11 @@
> > > >
> > > >   debug_objects   [KNL] Enable object debugging
> > > >
> > > > + fb_tunnels= [NET]
> > > > + Format: { initns | none }
> > > > + See Documentation/admin-guide/sysctl/net.rst for
> > > > + fb_tunnels_only_for_init_ns
> > > > +
> > >
> > > Not at this location in this file.
> > > Entries in this file are meant to be in alphabetical order (mostly).
> > >
> > > So leave debug_objects and no_debug_objects together, and insert 
> > > fb_tunnels
> > > between fail_make_request= and floppy=.
> > >
> > I see. I'll fix it in the next revision.
> > thanks for the suggestion.
> > --mahesh..
> >
> > > Thanks.
> > >
> > > >   no_debug_objects
> > > >   [KNL] Disable object debugging
> > > >
> > >
> > > --
> > > ~Randy
>
> Setting it to 1 via kcmdline doesn't seem all that useful, since
> instead of that you can just use initrc/sysctl.conf/etc.
>
> Would it be simpler if it was just 'no_fb_tunnels' or
> 'no_fallback_tunnels' and the function just set the sysctl to 2
> unconditionally?
> (ie. no = parsing at all) that would also be less code...
>
To make it simple; all methods should be able to set all values. Otherwise I
agree that it makes less sense to set value = 1 via kcmd. Also one can assign
value = 2 to sysctl once kernel is booted, it may not produce desired results
always but would work if you load modules after changing the sysctl value. I
guess the idea here is to give user full control on what their
situation is and choose
the correct method for the desired end result.

> btw. I also don't understand the '!IS_ENABLED(CONFIG_SYSCTL) ||'
> piece.  Why not just delete that?
> This seems to force fallback tunnels if you disable CONFIG_SYSCTL...
> but (a) then the kcmdline option doesn't work,
> and (b) that should already just happen by virtue of the sysctl defaulting to 
> 0.
agreed. will remove it (!IS_ENABLED(CONFIG_SYSCTL) check) in the next revision.


Re: [PATCHv2 next] net: add option to not create fall-back tunnels in root-ns as well

2020-08-25 Thread महेश बंडेवार
On Tue, Aug 25, 2020 at 3:47 PM Randy Dunlap  wrote:
>
> On 8/25/20 3:42 PM, Mahesh Bandewar wrote:
> > The sysctl that was added  earlier by commit 79134e6ce2c ("net: do
> > not create fallback tunnels for non-default namespaces") to create
> > fall-back only in root-ns. This patch enhances that behavior to provide
> > option not to create fallback tunnels in root-ns as well. Since modules
> > that create fallback tunnels could be built-in and setting the sysctl
> > value after booting is pointless, so added a kernel cmdline options to
> > change this default. The default setting is preseved for backward
> > compatibility. The kernel command line option of fb_tunnels=initns will
> > set the sysctl value to 1 and will create fallback tunnels only in initns
> > while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and
> > fallback tunnels are skipped in every netns.
> >
> > Signed-off-by: Mahesh Bandewar 
> > Cc: Eric Dumazet 
> > Cc: Maciej Zenczykowski 
> > Cc: Jian Yang 
> > ---
> > v1->v2
> >   Removed the Kconfig option which would force rebuild and replaced with
> >   kcmd-line option
> >
> >  .../admin-guide/kernel-parameters.txt |  5 +
> >  Documentation/admin-guide/sysctl/net.rst  | 20 +--
> >  include/linux/netdevice.h |  7 ++-
> >  net/core/sysctl_net_core.c| 17 ++--
> >  4 files changed, 40 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index a1068742a6df..09a51598c792 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -801,6 +801,11 @@
> >
> >   debug_objects   [KNL] Enable object debugging
> >
> > + fb_tunnels= [NET]
> > + Format: { initns | none }
> > + See Documentation/admin-guide/sysctl/net.rst for
> > + fb_tunnels_only_for_init_ns
> > +
>
> Not at this location in this file.
> Entries in this file are meant to be in alphabetical order (mostly).
>
> So leave debug_objects and no_debug_objects together, and insert fb_tunnels
> between fail_make_request= and floppy=.
>
I see. I'll fix it in the next revision.
thanks for the suggestion.
--mahesh..

> Thanks.
>
> >   no_debug_objects
> >   [KNL] Disable object debugging
> >
>
> --
> ~Randy
>


Re: [PATCH next] net: add option to not create fall-back tunnels in root-ns as well

2020-08-21 Thread महेश बंडेवार
On Fri, Aug 21, 2020 at 2:35 PM David Miller  wrote:
>
> From: Maciej Żenczykowski 
> Date: Fri, 21 Aug 2020 14:25:20 -0700
>
> > If no kernel command line option is specified, should the default
> > be to maintain compatibility, or do you think it's okay to make
> > the default be no extra interfaces?  They can AFAICT always be added
> > manually via 'ip link add' netlink commands.
>
> You can't change current default behavior, so the answer should be
> obvious right?
Yes, I don't want to change the default behavior that's why I kept the
default value = 0, but yes this config-option would force one to
rebuild the kernel.

OK, I'll respin it with kcmdline option instead of config option
maintaining the default behavior. thanks.


Re: [PATCHv3 next] blackhole_netdev: fix syzkaller reported issue

2019-10-16 Thread महेश बंडेवार
On Tue, Oct 15, 2019 at 10:36 AM David Miller  wrote:
>
> From: Mahesh Bandewar 
> Date: Fri, 11 Oct 2019 18:14:55 -0700
>
> > While invalidating the dst, we assign backhole_netdev instead of
> > loopback device. However, this device does not have idev pointer
> > and hence no ip6_ptr even if IPv6 is enabled. Possibly this has
> > triggered the syzbot reported crash.
> >
> > The syzbot report does not have reproducer, however, this is the
> > only device that doesn't have matching idev created.
> >
> > Crash instruction is :
> >
> > static inline bool ip6_ignore_linkdown(const struct net_device *dev)
> > {
> > const struct inet6_dev *idev = __in6_dev_get(dev);
> >
> > return !!idev->cnf.ignore_routes_with_linkdown; <= crash
> > }
> >
> > Also ipv6 always assumes presence of idev and never checks for it
> > being NULL (as does the above referenced code). So adding a idev
> > for the blackhole_netdev to avoid this class of crashes in the future.
> >
> > ---
>  ...
> > Fixes: 8d7017fd621d ("blackhole_netdev: use blackhole_netdev to invalidate 
> > dst entries")
> > Signed-off-by: Mahesh Bandewar 
>
> Applied and queued up for -stable, thanks.

Hi David,
I just started seeing pretty weird behavior in IPv6 stack after this
patch being applied to the net/stable branch. I'm suspecting it's my
patch especially after the v2/v3 changes as I didn't see any other
IPv6 changes in net/branch. I propose to revert this and take a closer
look so that it doesn't spill any further. I'll post a revert soon.

thanks,
--mahesh..


Re: [PATCHv2 next] blackhole_netdev: fix syzkaller reported issue

2019-10-11 Thread महेश बंडेवार
On Thu, Oct 10, 2019 at 5:37 PM Wei Wang  wrote:
>
> On Thu, Oct 10, 2019 at 9:48 AM Mahesh Bandewar  wrote:
> >
> > While invalidating the dst, we assign backhole_netdev instead of
> > loopback device. However, this device does not have idev pointer
> > and hence no ip6_ptr even if IPv6 is enabled. Possibly this has
> > triggered the syzbot reported crash.
> >
> > The syzbot report does not have reproducer, however, this is the
> > only device that doesn't have matching idev created.
> >
> > Crash instruction is :
> >
> > static inline bool ip6_ignore_linkdown(const struct net_device *dev)
> > {
> > const struct inet6_dev *idev = __in6_dev_get(dev);
> >
> > return !!idev->cnf.ignore_routes_with_linkdown; <= crash
> > }
> >
> > Also ipv6 always assumes presence of idev and never checks for it
> > being NULL (as does the above referenced code). So adding a idev
> > for the blackhole_netdev to avoid this class of crashes in the future.
> >
> > ---
> >
> > syzbot has found the following crash on:
> >
> > HEAD commit:125b7e09 net: tc35815: Explicitly check NET_IP_ALIGN is no..
> > git tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 
> > master
> > console output: 
> > https://syzkaller-buganizer.googleplex.com/text?tag=CrashLog&id=96236769ac48d9c2eb3a0db5385370ea6940be83
> > kernel config:  
> > https://syzkaller-buganizer.googleplex.com/text?tag=Config&id=1ed331b637302a68174fdbe34315b781b7c7ab1e
> > dashboard link: https://syzkaller.appspot.com/bug?extid=86298614df433d7f3824
> > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> >
> > See http://go/syzbot for details on how to handle this bug.
> >
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault:  [#1] PREEMPT SMP KASAN
> > CPU: 0 PID: 16525 Comm: syz-executor.2 Not tainted 5.3.0-rc3+ #159
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > RIP: 0010:ip6_ignore_linkdown include/net/addrconf.h:402 [inline]
> > RIP: 0010:find_match+0x132/0xd70 net/ipv6/route.c:743
> > Code: 0f b6 75 b0 40 84 f6 0f 84 ad 01 00 00 e8 c6 4c 34 fb 49 8d bf 3c 02 
> > 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 
> > 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 15
> > RSP: 0018:88806c037038 EFLAGS: 00010207
> > RAX: dc00 RBX: 888094c5cd60 RCX: c9000b791000
> > RDX: 0047 RSI: 863e3caa RDI: 023c
> > RBP: 88806c037098 R08: 888063de0600 R09: 88806c037288
> > R10: fbfff134af07 R11: 89a5783f R12: 0003
> > R13: 88806c037298 R14: 888094c5cd6f R15: 
> > FS:  7fa15e7de700() GS:8880ae80() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 00625208 CR3: 5e348000 CR4: 001406f0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> > __find_rr_leaf+0x14e/0x750 net/ipv6/route.c:831
> > find_rr_leaf net/ipv6/route.c:852 [inline]
> > rt6_select net/ipv6/route.c:896 [inline]
> > fib6_table_lookup+0x697/0xdb0 net/ipv6/route.c:2164
> > ip6_pol_route+0x1f6/0xaf0 net/ipv6/route.c:2200
> > ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2452
> > fib6_rule_lookup+0x133/0x7a0 net/ipv6/fib6_rules.c:113
> > ip6_route_output_flags_noref+0x2d6/0x360 net/ipv6/route.c:2484
> > ip6_route_output_flags+0x106/0x4d0 net/ipv6/route.c:2497
> > ip6_route_output include/net/ip6_route.h:98 [inline]
> > ip6_dst_lookup_tail+0x1042/0x1ef0 net/ipv6/ip6_output.c:1022
> > ip6_dst_lookup_flow+0xa8/0x220 net/ipv6/ip6_output.c:1150
> > ip6_sk_dst_lookup_flow net/ipv6/ip6_output.c:1188 [inline]
> > ip6_sk_dst_lookup_flow+0x62a/0xb90 net/ipv6/ip6_output.c:1178
> > udpv6_sendmsg+0x17ba/0x2990 net/ipv6/udp.c:1439
> > inet6_sendmsg+0x9e/0xe0 net/ipv6/af_inet6.c:576
> > sock_sendmsg_nosec net/socket.c:637 [inline]
> > sock_sendmsg+0xd7/0x130 net/socket.c:657
> > __sys_sendto+0x262/0x380 net/socket.c:1952
> > __do_sys_sendto net/socket.c:1964 [inline]
> > __se_sys_sendto net/socket.c:1960 [inline]
> > __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1960
> > do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x459829
> > Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 
> > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff 
> > ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> > RSP: 002b:7fa15e7ddc78 EFLAGS: 0246 ORIG_RAX: 002c
> > RAX: ffda RBX: 0006 RCX: 00459829
> > RDX: 0001 RSI: 2000 RDI: 0007
> > RBP: 0075c1c0 R08: 20001000 R09: 001c
> > R10:  R11: 00

Re: [PATCH next] blackhole_netdev: fix syzkaller reported issue

2019-10-09 Thread महेश बंडेवार
On Wed, Oct 9, 2019 at 4:15 PM Mahesh Bandewar  wrote:
>
> While invalidating the dst, we assign backhole_netdev instead of
> loopback device. However, this device does not have idev pointer
> and hence no ip6_ptr even if IPv6 is enabled. Possibly this has
> triggered the syzbot reported crash.
>
> The syzbot report does not have reproducer, however, this is the
> only device that doesn't have matching idev created.
>
> Crash instruction is :
>
> static inline bool ip6_ignore_linkdown(const struct net_device *dev)
> {
> const struct inet6_dev *idev = __in6_dev_get(dev);
>
> return !!idev->cnf.ignore_routes_with_linkdown; <= crash
> }
>
> Also ipv6 always assumes presence of idev and never checks for it
> being NULL (as does the above referenced code). So adding a idev
> for the blackhole_netdev to avoid this class of crashes in the future.
>
> ---
>
> syzbot has found the following crash on:
>
> HEAD commit:125b7e09 net: tc35815: Explicitly check NET_IP_ALIGN is no..
> git tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 
> master
> console output: 
> https://syzkaller-buganizer.googleplex.com/text?tag=CrashLog&id=96236769ac48d9c2eb3a0db5385370ea6940be83
> kernel config:  
> https://syzkaller-buganizer.googleplex.com/text?tag=Config&id=1ed331b637302a68174fdbe34315b781b7c7ab1e
> dashboard link: https://syzkaller.appspot.com/bug?extid=86298614df433d7f3824
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> See http://go/syzbot for details on how to handle this bug.
>
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 16525 Comm: syz-executor.2 Not tainted 5.3.0-rc3+ #159
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:ip6_ignore_linkdown include/net/addrconf.h:402 [inline]
> RIP: 0010:find_match+0x132/0xd70 net/ipv6/route.c:743
> Code: 0f b6 75 b0 40 84 f6 0f 84 ad 01 00 00 e8 c6 4c 34 fb 49 8d bf 3c 02 00 
> 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 
> 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 15
> RSP: 0018:88806c037038 EFLAGS: 00010207
> RAX: dc00 RBX: 888094c5cd60 RCX: c9000b791000
> RDX: 0047 RSI: 863e3caa RDI: 023c
> RBP: 88806c037098 R08: 888063de0600 R09: 88806c037288
> R10: fbfff134af07 R11: 89a5783f R12: 0003
> R13: 88806c037298 R14: 888094c5cd6f R15: 
> FS:  7fa15e7de700() GS:8880ae80() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 00625208 CR3: 5e348000 CR4: 001406f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
> __find_rr_leaf+0x14e/0x750 net/ipv6/route.c:831
> find_rr_leaf net/ipv6/route.c:852 [inline]
> rt6_select net/ipv6/route.c:896 [inline]
> fib6_table_lookup+0x697/0xdb0 net/ipv6/route.c:2164
> ip6_pol_route+0x1f6/0xaf0 net/ipv6/route.c:2200
> ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2452
> fib6_rule_lookup+0x133/0x7a0 net/ipv6/fib6_rules.c:113
> ip6_route_output_flags_noref+0x2d6/0x360 net/ipv6/route.c:2484
> ip6_route_output_flags+0x106/0x4d0 net/ipv6/route.c:2497
> ip6_route_output include/net/ip6_route.h:98 [inline]
> ip6_dst_lookup_tail+0x1042/0x1ef0 net/ipv6/ip6_output.c:1022
> ip6_dst_lookup_flow+0xa8/0x220 net/ipv6/ip6_output.c:1150
> ip6_sk_dst_lookup_flow net/ipv6/ip6_output.c:1188 [inline]
> ip6_sk_dst_lookup_flow+0x62a/0xb90 net/ipv6/ip6_output.c:1178
> udpv6_sendmsg+0x17ba/0x2990 net/ipv6/udp.c:1439
> inet6_sendmsg+0x9e/0xe0 net/ipv6/af_inet6.c:576
> sock_sendmsg_nosec net/socket.c:637 [inline]
> sock_sendmsg+0xd7/0x130 net/socket.c:657
> __sys_sendto+0x262/0x380 net/socket.c:1952
> __do_sys_sendto net/socket.c:1964 [inline]
> __se_sys_sendto net/socket.c:1960 [inline]
> __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1960
> do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x459829
> Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7fa15e7ddc78 EFLAGS: 0246 ORIG_RAX: 002c
> RAX: ffda RBX: 0006 RCX: 00459829
> RDX: 0001 RSI: 2000 RDI: 0007
> RBP: 0075c1c0 R08: 20001000 R09: 001c
> R10:  R11: 0246 R12: 7fa15e7de6d4
> R13: 004c77e7 R14: 004dd048 R15: 
> Modules linked in:
> ---[ end trace 4b3ce5eddd15c8f6 ]---
> RIP: 0010:ip6_ignore_linkdown include/net/addrconf.h:402 [inline]
> RIP: 0010

Re: [PATCH iproute2 2/2] ip tunnel: warn when changing IPv6 tunnel without tunnel name

2019-07-09 Thread महेश बंडेवार
On Tue, Jul 9, 2019 at 6:16 AM Andrea Claudi  wrote:
>
> Tunnel change fails if a tunnel name is not specified while using
> 'ip -6 tunnel change'. However, no warning message is printed and
> no error code is returned.
>
> $ ip -6 tunnel add ip6tnl1 mode ip6gre local fd::1 remote fd::2 tos inherit 
> ttl 127 encaplimit none dev dummy0
> $ ip -6 tunnel change dev dummy0 local 2001:1234::1 remote 2001:1234::2
> $ ip -6 tunnel show ip6tnl1
> ip6tnl1: gre/ipv6 remote fd::2 local fd::1 dev dummy0 encaplimit none 
> hoplimit 127 tclass inherit flowlabel 0x0 (flowinfo 0x)
>
> This commit checks if tunnel interface name is equal to an empty
> string: in this case, it prints a warning message to the user.
> It intentionally avoids to return an error to not break existing
> script setup.
>
> This is the output after this commit:
> $ ip -6 tunnel add ip6tnl1 mode ip6gre local fd::1 remote fd::2 tos inherit 
> ttl 127 encaplimit none dev dummy0
> $ ip -6 tunnel change dev dummy0 local 2001:1234::1 remote 2001:1234::2
> Tunnel interface name not specified
> $ ip -6 tunnel show ip6tnl1
> ip6tnl1: gre/ipv6 remote fd::2 local fd::1 dev dummy0 encaplimit none 
> hoplimit 127 tclass inherit flowlabel 0x0 (flowinfo 0x)
>
> Reviewed-by: Matteo Croce 
> Signed-off-by: Andrea Claudi 

I tried your patch and the commands that I posted in my (previous) patch.

Here is the output after reverting my patch and applying your patch



vm0:/tmp# ./ip -6 tunnel add ip6tnl1 mode ip6gre local fd::1 remote
fd::2 tos inherit ttl 127 encaplimit none
vm0:/tmp# ./ip -6 tunnel show dev ip6tnl1
vm0:/tmp# echo $?
0

here the output is NULL and return code is 0. This is wrong and I
would expect to see the tunnel info (as displayed in 'ip -6 tunnel
show ip6tnl1')


lpaa10:/tmp# ip -6 tunnel change dev ip6tnl1 local 2001:1234::1 remote
2001:1234::2 encaplimit none ttl 127 tos inherit allow-localremote
lpaa10:/tmp# echo $?
0
lpaa10:/tmp# ip -6 tunnel show dev ip6tnl1
lpaa10:/tmp# ip -6 tunnel show ip6tnl1
ip6tnl1: gre/ipv6 remote fd::2 local fd::1 encaplimit none hoplimit
127 tclass inherit flowlabel 0x0 (flowinfo 0x)

the change command appeared to be successful but change wasn't applied
(expecting the allow-localremote to be present on the tunnel).
---
>  ip/ip6tunnel.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
> index 999408ed801b1..e3da11eb4518e 100644
> --- a/ip/ip6tunnel.c
> +++ b/ip/ip6tunnel.c
> @@ -386,6 +386,9 @@ static int do_add(int cmd, int argc, char **argv)
> if (parse_args(argc, argv, cmd, &p) < 0)
> return -1;
>
> +   if (!*p.name)
> +   fprintf(stderr, "Tunnel interface name not specified\n");
> +
> if (p.proto == IPPROTO_GRE)
> basedev = "ip6gre0";
> else if (p.i_flags & VTI_ISVTI)
> --
> 2.20.1
>


Re: [PATCH iproute2 1/2] Revert "ip6tunnel: fix 'ip -6 {show|change} dev ' cmds"

2019-07-09 Thread महेश बंडेवार
On Tue, Jul 9, 2019 at 6:16 AM Andrea Claudi  wrote:
>
> This reverts commit ba126dcad20e6d0e472586541d78bdd1ac4f1123.
> It breaks tunnel creation when using 'dev' parameter:
>
> $ ip link add type dummy
> $ ip -6 tunnel add ip6tnl1 mode ip6ip6 remote 2001:db8::100::2 local 
> 2001:db8::100::1 hoplimit 1 tclass 0x0 dev dummy0
> add tunnel "ip6tnl0" failed: File exists
>
> dev parameter must be used to specify the device to which
> the tunnel is binded, and not the tunnel itself.
>
> Reported-by: Jianwen Ji 
> Reviewed-by: Matteo Croce 
> Signed-off-by: Andrea Claudi 
> ---
>  ip/ip6tunnel.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
> index 56fd3466ed062..999408ed801b1 100644
> --- a/ip/ip6tunnel.c
> +++ b/ip/ip6tunnel.c
> @@ -298,8 +298,6 @@ static int parse_args(int argc, char **argv, int cmd, 
> struct ip6_tnl_parm2 *p)
> p->link = ll_name_to_index(medium);
> if (!p->link)
> return nodev(medium);
> -   else
> -   strlcpy(p->name, medium, sizeof(p->name));
NACK

I see that ba126dcad20e6d0e472586541d78bdd1ac4f1123 has broke
something but that doesn't mean revert of the original fix is correct
way of fixing it. The original fix is fixing the show and change
command. Shouldn't you try fixing the add command so that all (show,
change, and add) work correctly?

> }
> return 0;
>  }
> --
> 2.20.1
>


Re: [PATCHv2 next 3/3] blackhole_dev: add a selftest

2019-07-01 Thread महेश बंडेवार
On Sat, Jun 29, 2019 at 12:28 PM David Miller  wrote:
>
> From: Mahesh Bandewar 
> Date: Thu, 27 Jun 2019 12:43:09 -0700
>
> > +config TEST_BLACKHOLE_DEV
> > + tristate "Test BPF filter functionality"
>
> I think the tristate string needs to be changed :-)
side effects of copy-paste :(
sending v3


Re: [PATCHv2 next 0/3] blackhole device to invalidate dst

2019-06-28 Thread महेश बंडेवार
On Fri, Jun 28, 2019 at 11:22 AM Michael Chan  wrote:
>
> On Thu, Jun 27, 2019 at 12:42 PM Mahesh Bandewar  wrote:
>
> > However, Michael Chan  had a setup
> > where these fixes helped him mitigate the issue and not cause
> > the crash.
> >
>
> Our lab has finished testing these patches.  The patches work in the
> sense that no oversize packets are now passed to the driver with the
> patches applied.  But I'm not seeing these bad packets reaching the
> blackhole device and getting dropped there.  So they get dropped in
> some other code paths.  I believe we saw the same results with your
> earlier patches.
>
Thanks Michael for confirmation. I would say that is WAI. With the MTU
that low, I don't think .ndo_xmit for this device would ever be
triggered.


Re: [PATCH next 3/3] blackhole_dev: add a selftest

2019-06-27 Thread महेश बंडेवार
On Thu, Jun 27, 2019 at 11:08 AM David Miller  wrote:
>
> From: Mahesh Bandewar 
> Date: Fri, 21 Jun 2019 17:45:39 -0700
>
> > --- a/tools/testing/selftests/net/Makefile
> > +++ b/tools/testing/selftests/net/Makefile
> > @@ -4,8 +4,9 @@
> >  CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
> >  CFLAGS += -I../../../../usr/include/
> >
> > +<<< HEAD
> >  TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh \
>
> Ummm... yeah... might want to resolve this conflict...
>
oops, my bad! Let me send v2
> :-)


Re: [PATCH next 0/3] blackhole device to invalidate dst

2019-06-24 Thread महेश बंडेवार
On Mon, Jun 24, 2019 at 9:00 PM Michael Chan  wrote:
>
> On Fri, Jun 21, 2019 at 5:45 PM Mahesh Bandewar  wrote:
>
> > Well, I'm not a TCP expert and though we have experienced
> > these corner cases in our environment, I could not reproduce
> > this case reliably in my test setup to try this fix myself.
> > However, Michael Chan  had a setup
> > where these fixes helped him mitigate the issue and not cause
> > the crash.
> >
>
> I will ask the lab to test these patches tomorrow.  Thanks.
Thank you Michael.


Re: [PATCH next 0/3] blackhole device to invalidate dst

2019-06-23 Thread महेश बंडेवार
On Sat, Jun 22, 2019 at 8:35 AM David Ahern  wrote:
>
> On 6/21/19 6:45 PM, Mahesh Bandewar wrote:
> > When we invalidate dst or mark it "dead", we assign 'lo' to
> > dst->dev. First of all this assignment is racy and more over,
> > it has MTU implications.
> >
> > The standard dev MTU is 1500 while the Loopback MTU is 64k. TCP
> > code when dereferencing the dst don't check if the dst is valid
> > or not. TCP when dereferencing a dead-dst while negotiating a
> > new connection, may use dst device which is 'lo' instead of
> > using the correct device. Consider the following scenario:
> >
>
> Why doesn't the TCP code (or any code) check if a cached dst is valid?
> That's the whole point of marking it dead - to tell users not to rely on
> it.

Well, I'm not an expert in TCP, but I could guess. Eric, please help
me being honest with my guess.
The code that marks it dead (control path) and the code that need to
check the validity (data-path) need to have some sort of
synchronization and that could be costly in the data-path. Also what
is the worst case scenario? ... that packet is going to get dropped
since the under-lying device or route has issues (minus this corner
case). May be that was the reason.

As I mentioned in the log, we could  remove the racy-ness  with
barriers or locks but then that would be an additional cost in the
data-path. having a dummy/backhole dev is the cheapest solution with
no changes to the data-path.


Re: [PATCH net] ipvlan: disallow userns cap_net_admin to change global mode/flags

2019-02-20 Thread महेश बंडेवार
On Tue, Feb 19, 2019 at 3:38 PM Daniel Borkmann  wrote:
>
> When running Docker with userns isolation e.g. --userns-remap="default"
> and spawning up some containers with CAP_NET_ADMIN under this realm, I
> noticed that link changes on ipvlan slave device inside that container
> can affect all devices from this ipvlan group which are in other net
> namespaces where the container should have no permission to make changes
> to, such as the init netns, for example.
>
> This effectively allows to undo ipvlan private mode and switch globally to
> bridge mode where slaves can communicate directly without going through
> hostns, or it allows to switch between global operation mode (l2/l3/l3s)
> for everyone bound to the given ipvlan master device. libnetwork plugin
> here is creating an ipvlan master and ipvlan slave in hostns and a slave
> each that is moved into the container's netns upon creation event.
>
> * In hostns:
>
>   # ip -d a
>   [...]
>   8: cilium_host@bond0:  mtu 1500 
> qdisc noqueue state UNKNOWN group default qlen 1000
>  link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 
> 68 maxmtu 65535
>  ipvlan  mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 
> gso_max_segs 65535
>  inet 10.41.0.1/32 scope link cilium_host
>valid_lft forever preferred_lft forever
>   [...]
>
> * Spawn container & change ipvlan mode setting inside of it:
>
>   # docker run -dt --cap-add=NET_ADMIN --network cilium-net --name client -l 
> app=test cilium/netperf
>   9fff485d69dcb5ce37c9e33ca20a11ccafc236d690105aadbfb77e4f4170879c
>
>   # docker exec -ti client ip -d a
>   [...]
>   10: cilium0@if4:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default qlen 1000
>   link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 
> 68 maxmtu 65535
>   ipvlan  mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 
> gso_max_segs 65535
>   inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
>  valid_lft forever preferred_lft forever
>
>   # docker exec -ti client ip link change link cilium0 name cilium0 type 
> ipvlan mode l2
>
>   # docker exec -ti client ip -d a
>   [...]
>   10: cilium0@if4:  mtu 1500 qdisc noqueue 
> state UNKNOWN group default qlen 1000
>   link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 
> 68 maxmtu 65535
>   ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 
> gso_max_segs 65535
>   inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
>  valid_lft forever preferred_lft forever
>
> * In hostns (mode switched to l2):
>
>   # ip -d a
>   [...]
>   8: cilium_host@bond0:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default qlen 1000
>   link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 
> 68 maxmtu 65535
>   ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 
> gso_max_segs 65535
>   inet 10.41.0.1/32 scope link cilium_host
>  valid_lft forever preferred_lft forever
>   [...]
>
> Same l3 -> l2 switch would also happen by creating another slave inside
> the container's network namespace when specifying the existing cilium0
> link to derive the actual (bond0) master:
>
>   # docker exec -ti client ip link add link cilium0 name cilium1 type ipvlan 
> mode l2
>
>   # docker exec -ti client ip -d a
>   [...]
>   2: cilium1@if4:  mtu 1500 qdisc noop state DOWN group 
> default qlen 1000
>   link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 
> 68 maxmtu 65535
>   ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 
> gso_max_segs 65535
>   10: cilium0@if4:  mtu 1500 qdisc noqueue 
> state UNKNOWN group default qlen 1000
>   link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 
> 68 maxmtu 65535
>   ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 
> gso_max_segs 65535
>   inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
>  valid_lft forever preferred_lft forever
>
> * In hostns:
>
>   # ip -d a
>   [...]
>   8: cilium_host@bond0:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default qlen 1000
>   link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 
> 68 maxmtu 65535
>   ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 
> gso_max_segs 65535
>   inet 10.41.0.1/32 scope link cilium_host
>  valid_lft forever preferred_lft forever
>   [...]
>
> One way to mitigate it is to check CAP_NET_ADMIN permissions of
> the ipvlan master device's ns, and only then allow to change
> mode or flags for all devices bound to it. Above two cases are
> then disallowed after the patch.
thanks for the fix Daniel.
>
> Signed-off-by: Daniel Borkmann 
Acked-by: Mahesh Bandewar 
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c 
> b/drivers/net/ipvlan/ipv

Re: Stack sends oversize UDP packet to the driver

2019-02-08 Thread महेश बंडेवार
On Wed, Feb 6, 2019 at 8:51 PM Mahesh Bandewar (महेश बंडेवार)
 wrote:
>
> On Tue, Feb 5, 2019 at 11:36 AM Michael Chan  
> wrote:
> >
> > On Wed, Jan 30, 2019 at 5:00 PM Mahesh Bandewar (महेश बंडेवार)
> >  wrote:
> > >
> > > On Wed, Jan 30, 2019 at 1:07 AM Michael Chan  
> > > wrote:
> > > >
> > > > On Tue, Jan 22, 2019 at 10:29 AM Mahesh Bandewar (महेश बंडेवार)
> > > >  wrote:
> > > >
> > > > >
> > > > > The idea behind the fix is very simple and it is to create a dst-only
> > > > > (unregistered) device with a very low MTU and use it instead of 'lo'
> > > > > while invalidating the dst. This would make it *not* forward packets
> > > > > to driver which might need fragmentation.
> > > > >
> > > >
> > > > We tested the 2 patches many times and including an overnight test.  I
> > > > can confirm that the oversize UDP packets are no longer seen with the
> > > > patches applied.  However, I don't see the blackhole xmit function
> > > > getting called to free the SKBs though.
> > > >
> > > Thanks for the confirmation Michael. The blackhole device mtu is
> > > really small, so I would assume the fragmentation code dropped those
> > > packets before calling the xmit function (in ip_fragment), you could
> > > verify that with icmp counters.
> > >
> >
> > I've looked at this a little more.  The blackhole_dev is not IFF_UP |
> > IFF_RUNNING, right?  May be that's why the packets are never getting
> > to the xmit function?
> Yes, so I added those two flags and ended up writing a test-module for
> the device (which I will include while posting the patch-series).
> However, adding those flags is also not sufficient since the qdisc is
> initialized to noop_qdisc so qdisc enqueue will drop packets before
> hitting the ndo_start_xmit().

I have another version of the fix (with help from Eric) and this
should hit the .ndo_start_xmit() of the blackhole_dev. I'm adding
these flags during the setup and then calling dev_activate() to change
noop qdisc to null qdisc. Please give this patch set a try and let me
know if the blackhole_dev xmit path gets exercised in your test
scenario.


0001-loopback-create-blackhole-net-device-similar-to-loop.patch
Description: Binary data


0002-blackhole_netdev-use-blackhole_netdev-to-invalidate-.patch
Description: Binary data


0003-blackhole_dev-add-a-selftest.patch
Description: Binary data


Re: Stack sends oversize UDP packet to the driver

2019-02-06 Thread महेश बंडेवार
On Tue, Feb 5, 2019 at 11:36 AM Michael Chan  wrote:
>
> On Wed, Jan 30, 2019 at 5:00 PM Mahesh Bandewar (महेश बंडेवार)
>  wrote:
> >
> > On Wed, Jan 30, 2019 at 1:07 AM Michael Chan  
> > wrote:
> > >
> > > On Tue, Jan 22, 2019 at 10:29 AM Mahesh Bandewar (महेश बंडेवार)
> > >  wrote:
> > >
> > > >
> > > > The idea behind the fix is very simple and it is to create a dst-only
> > > > (unregistered) device with a very low MTU and use it instead of 'lo'
> > > > while invalidating the dst. This would make it *not* forward packets
> > > > to driver which might need fragmentation.
> > > >
> > >
> > > We tested the 2 patches many times and including an overnight test.  I
> > > can confirm that the oversize UDP packets are no longer seen with the
> > > patches applied.  However, I don't see the blackhole xmit function
> > > getting called to free the SKBs though.
> > >
> > Thanks for the confirmation Michael. The blackhole device mtu is
> > really small, so I would assume the fragmentation code dropped those
> > packets before calling the xmit function (in ip_fragment), you could
> > verify that with icmp counters.
> >
>
> I've looked at this a little more.  The blackhole_dev is not IFF_UP |
> IFF_RUNNING, right?  May be that's why the packets are never getting
> to the xmit function?
Yes, so I added those two flags and ended up writing a test-module for
the device (which I will include while posting the patch-series).
However, adding those flags is also not sufficient since the qdisc is
initialized to noop_qdisc so qdisc enqueue will drop packets before
hitting the ndo_start_xmit().


Re: Stack sends oversize UDP packet to the driver

2019-01-30 Thread महेश बंडेवार
On Wed, Jan 30, 2019 at 1:07 AM Michael Chan  wrote:
>
> On Tue, Jan 22, 2019 at 10:29 AM Mahesh Bandewar (महेश बंडेवार)
>  wrote:
>
> >
> > The idea behind the fix is very simple and it is to create a dst-only
> > (unregistered) device with a very low MTU and use it instead of 'lo'
> > while invalidating the dst. This would make it *not* forward packets
> > to driver which might need fragmentation.
> >
>
> We tested the 2 patches many times and including an overnight test.  I
> can confirm that the oversize UDP packets are no longer seen with the
> patches applied.  However, I don't see the blackhole xmit function
> getting called to free the SKBs though.
>
Thanks for the confirmation Michael. The blackhole device mtu is
really small, so I would assume the fragmentation code dropped those
packets before calling the xmit function (in ip_fragment), you could
verify that with icmp counters.

> Thanks.


Re: Stack sends oversize UDP packet to the driver

2019-01-22 Thread महेश बंडेवार
On Mon, Jan 21, 2019 at 4:59 PM Michael Chan  wrote:
>
> On Mon, Jan 21, 2019 at 4:36 PM Daniel Axtens  wrote:
>
> > I hit a similar sounding issue on a bnx2x - see commit
> > 8914a595110a6eca69a5e275b323f5d09e18f4f9
> >
> > In that case, a GSO packet with gso_size too large for the firmware was
> > coming to the bnx2x driver from an ibmveth device via Open vSwitch. I
> > also toyed with a fix in the stack and ended up fixing just the driver.
>
> Hi Daniel, yes I remember the issue that you reported at that time.
> The current issue is similar but for non-TSO UDP packets.
>
> >
> > I was hoping to get a generic fix in to the stack afterwards, but didn't
> > get anything finished. Looking back at old branches, it looks like I
> > considered adding MTU validation to validate_xmit_skb,
>
> MTU validation in validate_xmit_skb() would definitely prevent the
> current reported issue.  Perhaps we can add a WARN() when an invalid
> length is detected so that the underlying issue will be reported and
> fixed.  The current issue is that somehow the dst of the SKB gets
> changed to "lo" and fragmentation is not done for the proper output
> device.  Thanks.

[Looks like my earlier reply didn't make it to most of the people in
the list, So let me include that here also]

>>>
We have been battling with something similar. The suspicion is that
when an egress packet gets transferred from L4 to L3 some strange
sequence of event causes dst invalidation which results into dst
device to change to 'lo' which has the MTU of 65k, this makes the
stack pass to a frame larger than the MTU to the driver which results
into the undesired outcome you have mentioned.

I have cooked few patches to address this issue, however, I'm finding
it really hard to reproduce this issue in my test environment. If you
could try them and see if those fixes the issue, we can refine them if
necessary and eventually formalize patches.
<<<

Actually while looking at this issue I did look at the fix Daniel
mentioned (8914a595110a6eca69a5e275b323f5d09e18f4f9) and wanted to
reach out to him to see if the solution that I'm trying to implement
in the stack works in that scenario too.

The idea behind the fix is very simple and it is to create a dst-only
(unregistered) device with a very low MTU and use it instead of 'lo'
while invalidating the dst. This would make it *not* forward packets
to driver which might need fragmentation.

Attaching the patches to this mail. [Let me know if I should provide
them in a different way]

thanks,
--mahesh..


0001-loopback-create-blackhole-net-device-similar-to-loop.patch
Description: Binary data


0002-blackhole_netdev-use-blackhole_netdev-to-invalidate-.patch
Description: Binary data


Re: [PATCH net 1/1] bonding: fix PACKET_ORIGDEV regression on bonding masters

2019-01-14 Thread महेश बंडेवार
On Mon, Jan 14, 2019 at 12:00 AM Vincent Bernat  wrote:
>
>  ❦ 13 janvier 2019 18:01 -08, Maciej Żenczykowski :
>
> > But I seem to recall that the core problem we were trying to solve was
> > that a daemon listening
> > on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
> > would not receive LLDP packets
> > arriving on inactive bond slaves (either active-backup or lag).
>
> Just tested and with 4.9.150, I am in fact unable to receive anything
> on a backup link when listening to the active-backup master device or to
> "any" device.
>
> > Perhaps going from:
> >   /* don't change skb->dev for link-local packets */
> >   if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return 
> > RX_HANDLER_PASS;
> >   if (bond_should_deliver_exact_match(skb, slave, bond)) return
> > RX_HANDLER_EXACT;
> >
> > to something more like:
> >   if (bond_should_deliver_exact_match(skb, slave, bond)) {
> > /* don't change skb->dev for link-local packets on inactive slaves */
> > if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return 
> > RX_HANDLER_PASS;
> > return RX_HANDLER_EXACT;
> >   }
> >
> > would fix both problems?
>
thanks for jumping in and offering a solution. This should fix the issue.

NACK for the revert-patch!

Folks, please, revert is not the solution! Last time when there was a
problem posted I offered you a solution, so wasn't that enough to
prove that we care about solving the problem that you are facing while
continuing to have this functionality? No one wants to break your use
case, it happens only because one is not aware of it. Thank you David
for resorting to resolve it.


> It makes PACKET_ORIGDEV works again. Moreover, when not binding to any
> interface, we receive packets on both active and backup links. But when
> binding to the master device, I only receive packets from the active
> devices (which is the same behaviour than pre-4.12). When not binding to
> any device and not using PACKET_ORIGDEV, one packet is said to be from
> the master device and one packet is said to be from the backup device.
> Previously, I had one packet from active device, one packet from backup
> device and two packets from master device.
>
> For me, this is a better situation than previously as we return to the
> situation before 4.12 but you can get what you want by not binding to
> any device _and_ using PACKET_ORIGDEV (otherwise, you are don't get the
> right interface in all cases).
>
> If it's unclear, I can provide more extensive results. I am using this
> test program (comment the s.bind/s.setsockopt line if needed):
>
> #v+
> #!/usr/bin/env python3
>
> import sys
> import socket
> import datetime
>
> socket.SOL_PACKET = 263
> socket.PACKET_ORIGDEV = 9
>
> s = socket.socket(socket.AF_PACKET,
>   socket.SOCK_RAW,
>   socket.htons(0x88cc))
> s.bind(("bond0", 0))
> s.setsockopt(socket.SOL_PACKET, socket.PACKET_ORIGDEV, 1)
> while True:
> data, addrinfo = s.recvfrom(1500)
> if addrinfo[2] == socket.PACKET_OUTGOING:
> continue
> print(f"{datetime.datetime.now().isoformat()}: "
>   f"Received {len(data)} bytes from {addrinfo}")
> #v-
> --
> Localise input and output in subroutines.
> - The Elements of Programming Style (Kernighan & Plauger)


Re: [PATCH iproute2] libnetlink: fix leak and using unused memory on error

2018-09-14 Thread महेश बंडेवार
On Thu, Sep 13, 2018 at 12:33 PM, Stephen Hemminger
 wrote:
> If an error happens in multi-segment message (tc only)
> then report the error and stop processing further responses.
> This also fixes refering to the buffer after free.
>
> The sequence check is not necessary here because the
> response message has already been validated to be in
> the window of the sequence number of the iov.
>
> Reported-by: Mahesh Bandewar 
> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> Signed-off-by: Stephen Hemminger 
Acked-by: Mahesh Bandewar 
> ---
>  lib/libnetlink.c | 23 +--
>  1 file changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
> index 928de1dd16d8..586809292594 100644
> --- a/lib/libnetlink.c
> +++ b/lib/libnetlink.c
> @@ -617,7 +617,6 @@ static int __rtnl_talk_iov(struct rtnl_handle *rtnl, 
> struct iovec *iov,
> msg.msg_iovlen = 1;
> i = 0;
> while (1) {
> -next:
> status = rtnl_recvmsg(rtnl->fd, &msg, &buf);
> ++i;
>
> @@ -660,27 +659,23 @@ next:
>
> if (l < sizeof(struct nlmsgerr)) {
> fprintf(stderr, "ERROR truncated\n");
> -   } else if (!err->error) {
> +   free(buf);
> +   return -1;
> +   }
> +
> +   if (!err->error)
> /* check messages from kernel */
> nl_dump_ext_ack(h, errfn);
>
> -   if (answer)
> -   *answer = (struct nlmsghdr 
> *)buf;
> -   else
> -   free(buf);
> -   if (h->nlmsg_seq == seq)
> -   return 0;
> -   else if (i < iovlen)
> -   goto next;
> -   return 0;
> -   }
> -
> if (rtnl->proto != NETLINK_SOCK_DIAG &&
> show_rtnl_err)
> rtnl_talk_error(h, err, errfn);
>
> errno = -err->error;
> -   free(buf);
> +   if (answer)
> +   *answer = (struct nlmsghdr *)buf;
> +   else
> +   free(buf);
> return -i;
> }
>
> --
> 2.18.0
>


Re: [PATCH net v2] bonding: pass link-local packets to bonding master also.

2018-09-13 Thread महेश बंडेवार
On Thu, Sep 13, 2018 at 4:00 PM, Michal Soltys  wrote:
> On 2018-07-19 18:20, Michal Soltys wrote:
>> On 07/19/2018 01:41 AM, Mahesh Bandewar wrote:
>>> From: Mahesh Bandewar 
>>>
>>> Commit b89f04c61efe ("bonding: deliver link-local packets with
>>> skb->dev set to link that packets arrived on") changed the behavior
>>> of how link-local-multicast packets are processed. The change in
>>> the behavior broke some legacy use cases where these packets are
>>> expected to arrive on bonding master device also.
>>>
>>> This patch passes the packet to the stack with the link it arrived
>>> on as well as passes to the bonding-master device to preserve the
>>> legacy use case.
>>>
>>> Fixes: b89f04c61efe ("bonding: deliver link-local packets with
>>> skb->dev set to link that packets arrived on")
>>> Reported-by: Michal Soltys 
>>> Signed-off-by: Mahesh Bandewar 
>>> ---
>>> v2: Added Fixes tag.
>>> v1: Initial patch.
>>>   drivers/net/bonding/bond_main.c | 17 +++--
>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c
>>> b/drivers/net/bonding/bond_main.c
>>> index 9a2ea3c1f949..1d3b7d8448f2 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -1177,9 +1177,22 @@ static rx_handler_result_t
>>> bond_handle_frame(struct sk_buff **pskb)
>>>   }
>>>   }
>>>   -/* don't change skb->dev for link-local packets */
>>> -if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
>>> +/* Link-local multicast packets should be passed to the
>>> + * stack on the link they arrive as well as pass them to the
>>> + * bond-master device. These packets are mostly usable when
>>> + * stack receives it with the link on which they arrive
>>> + * (e.g. LLDP) but there may be some legacy behavior that
>>> + * expects these packets to appear on bonding master too.
>>
>> I'd really change the comment from:
>>
>> "These packets are mostly usable when stack receives it with the link on
>> which they arrive (e.g. LLDP) but there may be some legacy behavior that
>> expects these packets to appear on bonding master too."
>>
>> to something like:
>>
>> "These packets are mostly usable when stack receives it with the link on
>> which they arrive, but they also must be available on aggregations. Some
>> of the use cases include (but are not limited to): LLDP agents that must
>> be able to operate both on enslaved interfaces as well as on bonds
>> themselves; linux bridges that must be able to process/pass BPDUs from
>> attached bonds when any kind of stp version is enabled on the network."
>>
>> It's a bit longer, but clarifies the reasons more precisely (without
>> going too deep into features like group_fwd_mask).
>>
>
> Anyway, any chance for that patch to get merged ? It would be great to
> get the correct functionality back asap. As for the comment, I'll submit
> a trivial patch expanding/clarifying it later (or I can resubmit
> adjusted v3 if it's ok with Mahesh).
Hmm, didn't notice that it wasn't merged but somehow it fell through
the cracks as it needed my attention earlier. I'll resubmit.


Re: [PATCH iproute2] iproute2: fix use-after-free

2018-09-13 Thread महेश बंडेवार
On Thu, Sep 13, 2018 at 8:19 AM, Stephen Hemminger
 wrote:
>
> On Wed, 12 Sep 2018 23:07:20 -0700
> Mahesh Bandewar (महेश बंडेवार)  wrote:
>
> > On Wed, Sep 12, 2018 at 5:33 PM, Stephen Hemminger
> >  wrote:
> > >
> > > On Wed, 12 Sep 2018 16:29:28 -0700
> > > Mahesh Bandewar  wrote:
> > >
> > > > From: Mahesh Bandewar 
> > > >
> > > > A local program using iproute2 lib pointed out the issue and looking
> > > > at the code it is pretty obvious -
> > > >
> > > > a = (struct nlmsghdr *)b;
> > > > ...
> > > > free(b);
> > > > if (a->nlmsg_seq == seq)
> > > > ...
> > > >
> > > > Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support 
> > > > malloc buff at run time")
> > > > Signed-off-by: Mahesh Bandewar 
> > >
> > > Yes, this is a real problem.
> > >
> > > Maybe a minimal patch like this would be enough:
> > actually that will leave the memory leak at the 'goto next' line (just
> > few lines below) since that jump will overwrite the buf.
>
> It looks like everytime in the while loop. a new buffer is allocated.
> So yes, it looks like old, my patch, and your patch would leak there
> was an error followed by other data in response.
> Though I doubt kernel would ever do that.
>
I started fixing the issue that I reported and then found-out the
memory leak and hence the first attempt of simple fix went into fixing
free-after-use as well as memory leak (in my patch). I'm not going to
claim that I know how and where this gets used, but my attempt was to
simply fix those two issues. I don't mind which fix you apply as long
as these issues get addressed.

> The only user of iov style messages to the kernel is in tc batching.
> My gut feeling is that if one message in batch has error, then
> the netlink code should return that error and stop processing more.


Re: [PATCH iproute2] iproute2: fix use-after-free

2018-09-12 Thread महेश बंडेवार
On Wed, Sep 12, 2018 at 5:33 PM, Stephen Hemminger
 wrote:
>
> On Wed, 12 Sep 2018 16:29:28 -0700
> Mahesh Bandewar  wrote:
>
> > From: Mahesh Bandewar 
> >
> > A local program using iproute2 lib pointed out the issue and looking
> > at the code it is pretty obvious -
> >
> > a = (struct nlmsghdr *)b;
> > ...
> > free(b);
> > if (a->nlmsg_seq == seq)
> > ...
> >
> > Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support malloc 
> > buff at run time")
> > Signed-off-by: Mahesh Bandewar 
>
> Yes, this is a real problem.
>
> Maybe a minimal patch like this would be enough:
actually that will leave the memory leak at the 'goto next' line (just
few lines below) since that jump will overwrite the buf.

> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
> index 928de1dd16d8..ab2d8452e4a1 100644
> --- a/lib/libnetlink.c
> +++ b/lib/libnetlink.c
> @@ -661,6 +661,8 @@ next:
> if (l < sizeof(struct nlmsgerr)) {
> fprintf(stderr, "ERROR truncated\n");
> } else if (!err->error) {
> +   __u32 err_seq = h->nlmsg_seq;
> +
> /* check messages from kernel */
> nl_dump_ext_ack(h, errfn);
>
> @@ -668,7 +670,8 @@ next:
> *answer = (struct nlmsghdr 
> *)buf;
> else
> free(buf);
> -   if (h->nlmsg_seq == seq)
> +
> +   if (err_seq == seq)
> return 0;
> else if (i < iovlen)
> goto next;
>


Re: [PATCHv2 iproute2 2/3] tc: remove extern from prototype declarations

2018-08-21 Thread महेश बंडेवार
On Tue, Aug 21, 2018 at 11:19 AM, Stephen Hemminger
 wrote:
> On Tue, 21 Aug 2018 10:48:54 -0700
> Mahesh Bandewar  wrote:
>
>> From: Mahesh Bandewar 
>>
>> Signed-off-by: Mahesh Bandewar 
>
> I already did this yesterday. Patch was on mailing list.
Hmm, I thought I did mention that I would take care in v2. In any
case, we can remove this from the patch-series if remaining patches
are fine. Does that make sense?


Re: [PATCH iproute2] iproute: make clang happy

2018-08-21 Thread महेश बंडेवार
On Mon, Aug 20, 2018 at 5:52 PM, Stephen Hemminger
 wrote:
> On Mon, 20 Aug 2018 16:44:28 -0700
> Mahesh Bandewar (महेश बंडेवार)  wrote:
>
>> On Mon, Aug 20, 2018 at 4:38 PM, Mahesh Bandewar (महेश बंडेवार)
>>  wrote:
>> > On Mon, Aug 20, 2018 at 3:52 PM, Stephen Hemminger
>> >  wrote:
>> >> On Mon, 20 Aug 2018 14:42:15 -0700
>> >> Mahesh Bandewar  wrote:
>> >>
>> >>> diff --git a/tc/m_ematch.c b/tc/m_ematch.c
>> >>> index ace4b3dd738b..a524b520b276 100644
>> >>> --- a/tc/m_ematch.c
>> >>> +++ b/tc/m_ematch.c
>> >>> @@ -277,6 +277,7 @@ static int flatten_tree(struct ematch *head, struct 
>> >>> ematch *tree)
>> >>>   return count;
>> >>>  }
>> >>>
>> >>> +__attribute__((format(printf, 5, 6)))
>> >>>  int em_parse_error(int err, struct bstr *args, struct bstr *carg,
>> >>>  struct ematch_util *e, char *fmt, ...)
>> >>
>> >> I think the printf attribute needs to go on the function prototype
>> >> here:
>> >> tc/m_ematch.h:extern int em_parse_error(int err, struct bstr *args, 
>> >> struct bstr *carg,
>> >>
>> > The attributes are attached to the definitions only and not prototype
>> > declarations. Please see the definition/declaration for jsonw_printf()
>> > in the same patch.
>> I take that back. Seems like it's fine either way.
>
> The reason to put the attributes in the .h file is that then the compiler
> can test for misuse in other files.  For example if em_parse_error had
> a bad format in em_u32.c, then the warning would not happen unless
> the attribute was on the function prototype.
>
correct, will take care in v2


Re: [PATCH iproute2] iproute: make clang happy

2018-08-20 Thread महेश बंडेवार
On Mon, Aug 20, 2018 at 4:44 PM, Mahesh Bandewar (महेश बंडेवार)
 wrote:
> On Mon, Aug 20, 2018 at 4:38 PM, Mahesh Bandewar (महेश बंडेवार)
>  wrote:
>> On Mon, Aug 20, 2018 at 3:52 PM, Stephen Hemminger
>>  wrote:
>>> On Mon, 20 Aug 2018 14:42:15 -0700
>>> Mahesh Bandewar  wrote:
>>>
>>>> diff --git a/tc/m_ematch.c b/tc/m_ematch.c
>>>> index ace4b3dd738b..a524b520b276 100644
>>>> --- a/tc/m_ematch.c
>>>> +++ b/tc/m_ematch.c
>>>> @@ -277,6 +277,7 @@ static int flatten_tree(struct ematch *head, struct 
>>>> ematch *tree)
>>>>   return count;
>>>>  }
>>>>
>>>> +__attribute__((format(printf, 5, 6)))
>>>>  int em_parse_error(int err, struct bstr *args, struct bstr *carg,
>>>>  struct ematch_util *e, char *fmt, ...)
>>>
>>> I think the printf attribute needs to go on the function prototype
>>> here:
>>> tc/m_ematch.h:extern int em_parse_error(int err, struct bstr *args, struct 
>>> bstr *carg,
>>>
>> The attributes are attached to the definitions only and not prototype
>> declarations. Please see the definition/declaration for jsonw_printf()
>> in the same patch.
> I take that back. Seems like it's fine either way.
>>>
>>> PS: I need to take the extern of  those function prototypes.
I can take care of this in v2


Re: [PATCH iproute2] iproute: make clang happy

2018-08-20 Thread महेश बंडेवार
On Mon, Aug 20, 2018 at 3:50 PM, Stephen Hemminger
 wrote:
> On Mon, 20 Aug 2018 14:42:15 -0700
> Mahesh Bandewar  wrote:
>
>>
>>   if (is_json_context()) {
>> + json_writer_t *jw;
>> +
>>   open_json_object("bittiming");
>>   print_int(PRINT_ANY, "bitrate", NULL, bt->bitrate);
>> - jsonw_float_field_fmt(get_json_writer(),
>> -   "sample_point", "%.3f",
>> -   (float) bt->sample_point / 
>> 1000.);
>> + jw = get_json_writer();
>> + jsonw_name(jw, "sample_point");
>> + jsonw_printf(jw, "%.3f",
>> +  (float) bt->sample_point / 1000);
>
> I think it would be better to get rid of the is_json_context() here in  the 
> CAN code
> and just use the print_json functions completely.  Most of the other code is 
> able to
> do that already.
Seems like this is not the only location and need to be taken care
every where in the CAN code. I'll leave it to some JSON / print expert


Re: [PATCH iproute2] iproute: make clang happy

2018-08-20 Thread महेश बंडेवार
On Mon, Aug 20, 2018 at 4:38 PM, Mahesh Bandewar (महेश बंडेवार)
 wrote:
> On Mon, Aug 20, 2018 at 3:52 PM, Stephen Hemminger
>  wrote:
>> On Mon, 20 Aug 2018 14:42:15 -0700
>> Mahesh Bandewar  wrote:
>>
>>> diff --git a/tc/m_ematch.c b/tc/m_ematch.c
>>> index ace4b3dd738b..a524b520b276 100644
>>> --- a/tc/m_ematch.c
>>> +++ b/tc/m_ematch.c
>>> @@ -277,6 +277,7 @@ static int flatten_tree(struct ematch *head, struct 
>>> ematch *tree)
>>>   return count;
>>>  }
>>>
>>> +__attribute__((format(printf, 5, 6)))
>>>  int em_parse_error(int err, struct bstr *args, struct bstr *carg,
>>>  struct ematch_util *e, char *fmt, ...)
>>
>> I think the printf attribute needs to go on the function prototype
>> here:
>> tc/m_ematch.h:extern int em_parse_error(int err, struct bstr *args, struct 
>> bstr *carg,
>>
> The attributes are attached to the definitions only and not prototype
> declarations. Please see the definition/declaration for jsonw_printf()
> in the same patch.
I take that back. Seems like it's fine either way.
>>
>> PS: I need to take the extern of  those function prototypes.


Re: [PATCH iproute2] iproute: make clang happy

2018-08-20 Thread महेश बंडेवार
On Mon, Aug 20, 2018 at 3:52 PM, Stephen Hemminger
 wrote:
> On Mon, 20 Aug 2018 14:42:15 -0700
> Mahesh Bandewar  wrote:
>
>> diff --git a/tc/m_ematch.c b/tc/m_ematch.c
>> index ace4b3dd738b..a524b520b276 100644
>> --- a/tc/m_ematch.c
>> +++ b/tc/m_ematch.c
>> @@ -277,6 +277,7 @@ static int flatten_tree(struct ematch *head, struct 
>> ematch *tree)
>>   return count;
>>  }
>>
>> +__attribute__((format(printf, 5, 6)))
>>  int em_parse_error(int err, struct bstr *args, struct bstr *carg,
>>  struct ematch_util *e, char *fmt, ...)
>
> I think the printf attribute needs to go on the function prototype
> here:
> tc/m_ematch.h:extern int em_parse_error(int err, struct bstr *args, struct 
> bstr *carg,
>
The attributes are attached to the definitions only and not prototype
declarations. Please see the definition/declaration for jsonw_printf()
in the same patch.
>
> PS: I need to take the extern of  those function prototypes.


Re: [PATCH iproute2] ipmaddr: use preferred_family when given

2018-08-17 Thread महेश बंडेवार
On Fri, Aug 17, 2018 at 9:29 AM, Stephen Hemminger
 wrote:
> On Wed, 15 Aug 2018 16:08:55 -0700
> Mahesh Bandewar  wrote:
>
>> From: Mahesh Bandewar 
>>
>> When creating socket() AF_INET is used irrespective of the family
>> that is given at the command-line (with -4, -6, or -0). This change
>> will open the socket with the preferred family.
>>
>> Signed-off-by: Mahesh Bandewar 
>> ---
>>  ip/ipmaddr.c | 13 -
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> What is impact of this? Does ip multicast address changes not work on IPv6?
> Or is it just doing the right thing?
Essentially a no-op. Just doing the right thing. :)


Re: [PATCH next v2] bonding: pass link-local packets to bonding master also.

2018-07-18 Thread महेश बंडेवार
On Wed, Jul 18, 2018 at 4:33 PM, David Miller  wrote:
> From: Mahesh Bandewar (महेश बंडेवार) 
> Date: Wed, 18 Jul 2018 16:19:17 -0700
>
>> On Wed, Jul 18, 2018 at 1:18 PM, David Miller  wrote:
>>> From: Mahesh Bandewar 
>>> Date: Wed, 18 Jul 2018 12:55:42 -0700
>>>
>>>> From: Mahesh Bandewar 
>>>>
>>>> Commit b89f04c61efe ("bonding: deliver link-local packets with
>>>> skb->dev set to link that packets arrived on") changed the behavior
>>>> of how link-local-multicast packets are processed. The change in
>>>> the behavior broke some legacy use cases where these packets are
>>>> expected to arrive on bonding master device also.
>>>>
>>>> This patch passes the packet to the stack with the link it arrived
>>>> on as well as passes to the bonding-master device to preserve the
>>>> legacy use case.
>>>>
>>>> Fixes: b89f04c61efe ("bonding: deliver link-local packets with skb->dev 
>>>> set to link that packets arrived on")
>>>> Reported-by: Michal Soltys 
>>>> Signed-off-by: Mahesh Bandewar 
>>>
>>> If this is a regression, it should target 'net' rather than 'net-next'
>>> so we can queue it up for -stable as well.
>>>
>> Yes, it is. Just forgot to revise the subject. Do you want me to
>> resend the patch?
>
> Yes, please.
sure, will do.


Re: [PATCH next v2] bonding: pass link-local packets to bonding master also.

2018-07-18 Thread महेश बंडेवार
On Wed, Jul 18, 2018 at 1:18 PM, David Miller  wrote:
> From: Mahesh Bandewar 
> Date: Wed, 18 Jul 2018 12:55:42 -0700
>
>> From: Mahesh Bandewar 
>>
>> Commit b89f04c61efe ("bonding: deliver link-local packets with
>> skb->dev set to link that packets arrived on") changed the behavior
>> of how link-local-multicast packets are processed. The change in
>> the behavior broke some legacy use cases where these packets are
>> expected to arrive on bonding master device also.
>>
>> This patch passes the packet to the stack with the link it arrived
>> on as well as passes to the bonding-master device to preserve the
>> legacy use case.
>>
>> Fixes: b89f04c61efe ("bonding: deliver link-local packets with skb->dev set 
>> to link that packets arrived on")
>> Reported-by: Michal Soltys 
>> Signed-off-by: Mahesh Bandewar 
>
> If this is a regression, it should target 'net' rather than 'net-next'
> so we can queue it up for -stable as well.
>
Yes, it is. Just forgot to revise the subject. Do you want me to
resend the patch?

thanks,
--mahesh..

> Thank you.


Re: [PATCH next] bonding: pass link-local packets to bonding master also.

2018-07-16 Thread महेश बंडेवार
On Mon, Jul 16, 2018 at 4:33 PM, Stephen Hemminger
 wrote:
> On Sun, 15 Jul 2018 18:12:46 -0700
> Mahesh Bandewar  wrote:
>
>> From: Mahesh Bandewar 
>>
>> Commit b89f04c61efe ("bonding: deliver link-local packets with
>> skb->dev set to link that packets arrived on") changed the behavior
>> of how link-local-multicast packets are processed. The change in
>> the behavior broke some legacy use cases where these packets are
>> expected to arrive on bonding master device also.
>>
>> This patch passes the packet to the stack with the link it arrived
>> on as well as passes to the bonding-master device to preserve the
>> legacy use case.
>>
>> Reported-by: Michal Soltys 
>> Signed-off-by: Mahesh Bandewar 
>
> Thanks for fixing this.
>
> Why not add a Fixes: tag instead of just talking about the commit?
> That helps the stable maintainers know which versions of the kernel
> need the patch.
Well, I thought about it. It's definitely 'related' but not sure it
'fixes' in true sense. It definitely fixes the broken legacy case
though. Is that sufficient to add 'fixes' tag?


Re: [PATCH next] bonding: pass link-local packets to bonding master also.

2018-07-16 Thread महेश बंडेवार
On Mon, Jul 16, 2018 at 2:24 PM, Jay Vosburgh
 wrote:
> Mahesh Bandewar  wrote:
>
>>From: Mahesh Bandewar 
>>
>>Commit b89f04c61efe ("bonding: deliver link-local packets with
>>skb->dev set to link that packets arrived on") changed the behavior
>>of how link-local-multicast packets are processed. The change in
>>the behavior broke some legacy use cases where these packets are
>>expected to arrive on bonding master device also.
>>
>>This patch passes the packet to the stack with the link it arrived
>>on as well as passes to the bonding-master device to preserve the
>>legacy use case.
>
> Michal, can you test this?  I'm travelling this week and won't
> be able to run the patch.
>
> Mahesh, will this confuse LLDP, et al, daemons that, e.g., bind
> to every possible interface and now see the same LLDP PDU (identical
> Chassis ID, Port ID, et al, TLVs) on multiple interfaces?
>
Well it's hard to say. In the previous world when these packets used
to appear only on bonding-master, that service had to go extra-lengths
to figure it out which link it actually came on in. With the earlier
change (SHA1: b89f04c61efe) it didn't have to but with this patch, the
best thing that they could do is just ignore those packets coming from
(any) virtual devices. The only reason why I'm OK with this change is
because L2 of a physical link is shared with a virtual link (bonding
master) and hence both links receiving the same link-local-multicast
seems acceptable. Making them appear only on bonding-master is just
wrong while correcting that behavior breaks the legacy use case and
here we are.

BTW when links are aggregated and using LACP, these packets don't
arrive the system-mac but the real mac of the sender with a dest
multicast-mac.

--mahesh..

> Thanks,
>
> -J
>
>>Reported-by: Michal Soltys 
>>Signed-off-by: Mahesh Bandewar 
>>---
>> drivers/net/bonding/bond_main.c | 17 +++--
>> 1 file changed, 15 insertions(+), 2 deletions(-)
>>
>>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>index 9a2ea3c1f949..1d3b7d8448f2 100644
>>--- a/drivers/net/bonding/bond_main.c
>>+++ b/drivers/net/bonding/bond_main.c
>>@@ -1177,9 +1177,22 @@ static rx_handler_result_t bond_handle_frame(struct 
>>sk_buff **pskb)
>>   }
>>   }
>>
>>-  /* don't change skb->dev for link-local packets */
>>-  if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
>>+  /* Link-local multicast packets should be passed to the
>>+   * stack on the link they arrive as well as pass them to the
>>+   * bond-master device. These packets are mostly usable when
>>+   * stack receives it with the link on which they arrive
>>+   * (e.g. LLDP) but there may be some legacy behavior that
>>+   * expects these packets to appear on bonding master too.
>>+   */
>>+  if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) {
>>+  struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>>+
>>+  if (nskb) {
>>+  nskb->dev = bond->dev;
>>+  netif_rx(nskb);
>>+  }
>>   return RX_HANDLER_PASS;
>>+  }
>>   if (bond_should_deliver_exact_match(skb, slave, bond))
>>   return RX_HANDLER_EXACT;
>>
>>--
>>2.18.0.203.gfac676dfb9-goog
>
> ---
> -Jay Vosburgh, jay.vosbu...@canonical.com


Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-12 Thread महेश बंडेवार
On Thu, Jul 12, 2018 at 4:14 PM, Michal Soltys  wrote:
> On 2018-07-13 00:03, Jay Vosburgh wrote:
>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>
>>>On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh
>>> wrote:
>>>> Michal Soltys  wrote:
>>>>
>>>>>On 07/12/2018 04:51 PM, Jay Vosburgh wrote:
>>>>>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>>>>>
>>>>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> As weird as that sounds, this is what I observed today after bumping
>>>>>>>> kernel version. I have a setup where 2 bonds are attached to linux
>>>>>>>> bridge and physically are connected to two switches doing MSTP (and
>>>>>>>> linux bridge is just passing them).
>>>>>>>>
>>>>>>>> Initially I suspected some changes related to bridge code - but quick
>>>>>>>> peek at the code showed nothing suspicious - and the part of it that
>>>>>>>> explicitly passes stp frames if stp is not enabled has seen little
>>>>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
>>>>>>>> regular non-bonded interfaces are attached everything works fine.
>>>>>>>>
>>>>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with
>>>>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
>>>>>>>> there (with them being present just fine on active enslaved interface,
>>>>>>>> or on the bond device in earlier kernels).
>>>>>>>>
>>>>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from
>>>>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
>>>>>>>> debian) and 4.17.3 (tested on archlinux) are failing.
>>>>>>>>
>>>>>>>> Unless this is already a known issue (or you have any suggestions what
>>>>>>>> could be responsible).
>>>>>>>>
>>>>>>> I believe these are link-local-multicast messages and sometime back a
>>>>>>> change went into to not pass those frames to the bonding master. This
>>>>>>> could be the side effect of that.
>>>>>>
>>>>>>  Mahesh, I suspect you're thinking of:
>>>>>>
>>>>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e
>>>>>> Author: Chonggang Li 
>>>>>> Date:   Sun Apr 16 12:02:18 2017 -0700
>>>>>>
>>>>>>  bonding: deliver link-local packets with skb->dev set to link that 
>>>>>> packets arrived on
>>>>>>
>>>>>>  Michal, are you able to revert this patch and test?
>>>>>>
>>>>>>  -J
>>>>>>
>>>>>> ---
>>>>>>  -Jay Vosburgh, jay.vosbu...@canonical.com
>>>>>>
>>>>>
>>>>>
>>>>>Just tested - yes, reverting that patch solves the issues.
>>>>
>>>> Chonggang,
>>>>
>>>> Reading the changelog in your commit referenced above, I'm not
>>>> entirely sure what actual problem it is fixing.  Could you elaborate?
>>>>
>>>> As the patch appears to cause a regression, it needs to be
>>>> either fixed or reverted.
>>>>
>>>> Mahesh, you signed-off on it as well, perhaps you also have some
>>>> context?
>>>>
>>>
>>>I think the original idea behind it was to pass the LLDPDUs to the
>>>stack on the interface that they came on since this is considered to
>>>be link-local traffic and passing to bond-master would loose it's
>>>"linklocal-ness". This is true for LLDP and if you change the skb->dev
>>>of the packet, then you don't know which slave link it came on in
>>>(from LLDP consumer's perspective).
>>>
>>>I don't know much about STP but trunking two links and aggregating
>>>this link info through bond-master seems wrong. Just like LLDP, you
>>>are losing info specific to a link and the decision derived from that
>>&

Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-12 Thread महेश बंडेवार
On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh
 wrote:
> Michal Soltys  wrote:
>
>>On 07/12/2018 04:51 PM, Jay Vosburgh wrote:
>>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>>
>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> As weird as that sounds, this is what I observed today after bumping
>>>>> kernel version. I have a setup where 2 bonds are attached to linux
>>>>> bridge and physically are connected to two switches doing MSTP (and
>>>>> linux bridge is just passing them).
>>>>>
>>>>> Initially I suspected some changes related to bridge code - but quick
>>>>> peek at the code showed nothing suspicious - and the part of it that
>>>>> explicitly passes stp frames if stp is not enabled has seen little
>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
>>>>> regular non-bonded interfaces are attached everything works fine.
>>>>>
>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with
>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
>>>>> there (with them being present just fine on active enslaved interface,
>>>>> or on the bond device in earlier kernels).
>>>>>
>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from
>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
>>>>> debian) and 4.17.3 (tested on archlinux) are failing.
>>>>>
>>>>> Unless this is already a known issue (or you have any suggestions what
>>>>> could be responsible).
>>>>>
>>>> I believe these are link-local-multicast messages and sometime back a
>>>> change went into to not pass those frames to the bonding master. This
>>>> could be the side effect of that.
>>>
>>>  Mahesh, I suspect you're thinking of:
>>>
>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e
>>> Author: Chonggang Li 
>>> Date:   Sun Apr 16 12:02:18 2017 -0700
>>>
>>>  bonding: deliver link-local packets with skb->dev set to link that 
>>> packets arrived on
>>>
>>>  Michal, are you able to revert this patch and test?
>>>
>>>  -J
>>>
>>> ---
>>>  -Jay Vosburgh, jay.vosbu...@canonical.com
>>>
>>
>>
>>Just tested - yes, reverting that patch solves the issues.
>
> Chonggang,
>
> Reading the changelog in your commit referenced above, I'm not
> entirely sure what actual problem it is fixing.  Could you elaborate?
>
> As the patch appears to cause a regression, it needs to be
> either fixed or reverted.
>
> Mahesh, you signed-off on it as well, perhaps you also have some
> context?
>

I think the original idea behind it was to pass the LLDPDUs to the
stack on the interface that they came on since this is considered to
be link-local traffic and passing to bond-master would loose it's
"linklocal-ness". This is true for LLDP and if you change the skb->dev
of the packet, then you don't know which slave link it came on in
(from LLDP consumer's perspective).

I don't know much about STP but trunking two links and aggregating
this link info through bond-master seems wrong. Just like LLDP, you
are losing info specific to a link and the decision derived from that
info could be wrong.

Having said that, we determine "linklocal-ness" by looking at L2 and
bondmaster shares this with lts slaves. So it does seem fair to pass
those frames to the bonding-master but at the same time link-local
traffic is supposed to be limited to the physical link (LLDP/STP/LACP
etc). Your thoughts?


> -J
>
> ---
> -Jay Vosburgh, jay.vosbu...@canonical.com


Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-11 Thread महेश बंडेवार
On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:
>
> Hi,
>
> As weird as that sounds, this is what I observed today after bumping
> kernel version. I have a setup where 2 bonds are attached to linux
> bridge and physically are connected to two switches doing MSTP (and
> linux bridge is just passing them).
>
> Initially I suspected some changes related to bridge code - but quick
> peek at the code showed nothing suspicious - and the part of it that
> explicitly passes stp frames if stp is not enabled has seen little
> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
> regular non-bonded interfaces are attached everything works fine.
>
> Just to be sure I detached the bond (802.3ad mode) and checked it with
> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
> there (with them being present just fine on active enslaved interface,
> or on the bond device in earlier kernels).
>
> If time permits I'll bisect tommorow to pinpoint the commit, but from
> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
> debian) and 4.17.3 (tested on archlinux) are failing.
>
> Unless this is already a known issue (or you have any suggestions what
> could be responsible).
>
I believe these are link-local-multicast messages and sometime back a
change went into to not pass those frames to the bonding master. This
could be the side effect of that.


Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-15 Thread महेश बंडेवार
On Wed, Mar 14, 2018 at 8:37 PM, Alexei Starovoitov
 wrote:
> On Wed, Mar 14, 2018 at 05:17:54PM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
>> > On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
>> >  wrote:
>> >>
>> >>> It seems this is exactly the case where a netns would be the correct 
>> >>> answer.
>> >>
>> >> Unfortuantely that's not the case. That's what I tried to explain
>> >> in the cover letter:
>> >> "The setup involves per-container IPs, policy, etc, so traditional
>> >> network-only solutions that involve VRFs, netns, acls are not applicable."
>> >> To elaborate more on that:
>> >> netns is l2 isolation.
>> >> vrf is l3 isolation.
>> >> whereas to containerize an application we need to punch connectivity holes
>> >> in these layered techniques.
>> >> We also considered resurrecting Hannes's afnetns work
>> >> and even went as far as designing a new namespace for L4 isolation.
>> >> Unfortunately all hierarchical namespace abstraction don't work.
>> >> To run an application inside cgroup container that was not written
>> >> with containers in mind we have to make an illusion of running
>> >> in non-containerized environment.
>> >> In some cases we remember the port and container id in the post-bind hook
>> >> in a bpf map and when some other task in a different container is trying
>> >> to connect to a service we need to know where this service is running.
>> >> It can be remote and can be local. Both client and service may or may not
>> >> be written with containers in mind and this sockaddr rewrite is providing
>> >> connectivity and load balancing feature that you simply cannot do
>> >> with hierarchical networking primitives.
>> >
>> > have to explain this a bit further...
>> > We also considered hacking these 'connectivity holes' in
>> > netns and/or vrf, but that would be real layering violation,
>> > since clean l2, l3 abstraction would suddenly support
>> > something that breaks through the layers.
>> > Just like many consider ipvlan a bad hack that punches
>> > through the layers and connects l2 abstraction of netns
>> > at l3 layer, this is not something kernel should ever do.
>> > We really didn't want another ipvlan-like hack in the kernel.
>> > Instead bpf programs at bind/connect time _help_
>> > applications discover and connect to each other.
>> > All containers are running in init_nens and there are no vrfs.
>> > After bind/connect the normal fib/neighbor core networking
>> > logic works as it should always do. The whole system is
>> > clean from network point of view.
>>
>>
>> We apparently missed something when deploying ipvlan and one netns per
>> container/job
>
> Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.

I had a long discussion with Hanness and there are two pending  issues
(discounting minor bug fixes / improvement). (a) the
multicast-group-membership and (b) early demux.
multicast group membership is just a matter of putting some code there
to fix it. While early-demux is little harder without violating
isolation boundaries. To me isolation is critical / important and if
we find a right solution that doesn't violate isolation, we'd fix it.

> I couldn't find the complete link. This one mentions some of the issues:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html
> Since ipvlan works for you, great, but it's clearly a layering violation.
> ipvlan connects L2 namespaces via L3 by doing its own fib lookups.
> To me it's a definition 'punch connectivity hole' in L2 abstraction.
> In normal L2 setup of netns+veth the traffic from one netns should
> have went into another netns via full L2. ipvlan cheats by giving
> L3 connectivity. It's not clean to me.

IPvlan supports three different modes and you have mixed all of them
while explaining your understanding of IPvlan. Probably one needs to
digest all these modes and evaluate them in the context of their use
case. Well, I'm not even going to attempt to explain the differences,
if you were serious you could have figured it out.

There are lots of use cases and people use it in interesting ways.
Each  case can be better handled by using either VRF, or macvlan, or
IPvlan or whatever is out there. It would be childish to say one
use-case is better than others as these are *different* use cases. All
these solutions come with their own caveats and you choose what you
can live with. Well, you can always improve and I can see Redhat folks
are doing it and I appreciate their efforts.

Like I said there are several different ways to make this work with
namespaces in much cleaner way and IPvlan does not need to be
involved. However adding another eBPF hook just because we can in a
hackish way is *not* the right way. Especially when a problem has
already been solved (with namespace) these 2000 lines dont deserve to
be in kernel. eBPF is a good tool and there is a thin line between
using it appropriately and misusing it. I don't want to argue

Re: [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks

2018-03-14 Thread महेश बंडेवार
On Tue, Mar 13, 2018 at 8:39 PM, Alexei Starovoitov  wrote:
> For our container management we've been using complicated and fragile setup
> consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
> all containerized applications.
> The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable.
You can keep the policies per cgroup but move the ip from cgroup to
net-ns and then none of these ebpf hacks are required since cgroup and
namespaces are orthogonal you can use cgroups in conjunction with
namespaces.

> Changing apps is not possible and LD_PRELOAD doesn't work
> for apps that don't use glibc like java and golang.
> BPF+cgroup looks to be the best solution for this problem.
> Hence we introduce 3 hooks:
> - at entry into sys_bind and sys_connect
>   to let bpf prog look and modify 'struct sockaddr' provided
>   by user space and fail bind/connect when appropriate
> - post sys_bind after port is allocated
>
> The approach works great and has zero overhead for anyone who doesn't
> use it and very low overhead when deployed.
>
> The main question for Daniel and Dave is what approach to take
> with prog types...
>
> In this patch set we introduce 6 new program types to make user
> experience easier:
>   BPF_PROG_TYPE_CGROUP_INET4_BIND,
>   BPF_PROG_TYPE_CGROUP_INET6_BIND,
>   BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
>   BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
>   BPF_PROG_TYPE_CGROUP_INET4_POST_BIND,
>   BPF_PROG_TYPE_CGROUP_INET6_POST_BIND,
>
> since v4 programs should not be using 'struct bpf_sock_addr'->user_ip6 fields
> and different prog type for v4 and v6 helps verifier reject such access
> at load time.
> Similarly bind vs connect are two different prog types too,
> since only sys_connect programs can call new bpf_bind() helper.
>
> This approach is very different from tcp-bpf where single
> 'struct bpf_sock_ops' and single prog type is used for different hooks.
> The field checks are done at run-time instead of load time.
>
> I think the approach taken by this patch set is justified,
> but we may do better if we extend BPF_PROG_ATTACH cmd
> with log_buf + log_size, then we should be able to combine
> bind+connect+v4+v6 into single program type.
> The idea that at load time the verifier will remember a bitmask
> of fields in bpf_sock_addr used by the program and helpers
> that program used, then at attach time we can check that
> hook is compatible with features used by the program and
> report human readable error message back via log_buf.
> We cannot do this right now with just EINVAL, since combinations
> of errors like 'using user_ip6 field but attaching to v4 hook'
> are too high to express as errno.
> This would be bigger change. If you folks think it's worth it
> we can go with this approach or if you think 6 new prog types
> is not too bad, we can leave the patch as-is.
> Comments?
> Other comments on patches are welcome.
>
> Andrey Ignatov (6):
>   bpf: Hooks for sys_bind
>   selftests/bpf: Selftest for sys_bind hooks
>   net: Introduce __inet_bind() and __inet6_bind
>   bpf: Hooks for sys_connect
>   selftests/bpf: Selftest for sys_connect hooks
>   bpf: Post-hooks for sys_bind
>
>  include/linux/bpf-cgroup.h|  68 +++-
>  include/linux/bpf_types.h |   6 +
>  include/linux/filter.h|  10 +
>  include/net/inet_common.h |   2 +
>  include/net/ipv6.h|   2 +
>  include/net/sock.h|   3 +
>  include/net/udp.h |   1 +
>  include/uapi/linux/bpf.h  |  52 ++-
>  kernel/bpf/cgroup.c   |  36 ++
>  kernel/bpf/syscall.c  |  42 ++
>  kernel/bpf/verifier.c |   6 +
>  net/core/filter.c | 479 ++-
>  net/ipv4/af_inet.c|  60 ++-
>  net/ipv4/tcp_ipv4.c   |  16 +
>  net/ipv4/udp.c|  14 +
>  net/ipv6/af_inet6.c   |  47 ++-
>  net/ipv6/tcp_ipv6.c   |  16 +
>  net/ipv6/udp.c|  20 +
>  tools/include/uapi/linux/bpf.h|  39 +-
>  tools/testing/selftests/bpf/Makefile  |   8 +-
>  tools/testing/selftests/bpf/bpf_helpers.h |   2 +
>  tools/testing/selftests/bpf/connect4_prog.c   |  45 +++
>  tools/testing/selftests/bpf/connect6_prog.c   |  61 +++
>  tools/testing/selftests/bpf/test_sock_addr.c  | 541 
> ++
>  tools/testing/selftests/bpf/test_sock_addr.sh |  57 +++
>  25 files changed, 1580 insertions(+), 53 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/connect4_prog.c
>  create mode 100644 tools/testing/selftests/bpf/connect6_prog.c
>  create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c
>  cre

Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-12 Thread महेश बंडेवार
On Fri, Jan 12, 2018 at 12:48 AM, Jiri Benc  wrote:
> On Fri, 12 Jan 2018 09:34:13 +0100, Jiri Benc wrote:
>> I don't think this works currently. When someone (does not have to be
>> you, it can be a management software running in background) sets the
>> MTU to the current value, the magic behavior is lost without any way to
>> restore it (unless I'm missing a way to restore it, see my question
>> above). So any user that depends on the magic behavior is broken anyway
>> even now.
>
> Upon further inspection, it seems that currently, slaves always follow
> master's MTU without a way to change it. Tough situation. Even
> implementing user space toggleable mtu_adj could break users in the way
> I described. But it seems to be the lesser evil, at least there would
> be a way to unbreak the scripts with one line addition.
>
> But it's absolute must to have this visible to the user space and
> changeable. Something like this:
>
> # ip a
> 123: ipvlan0:  mtu 1500 (auto) qdisc ...
> # ip l s ipvlan0 mtu 1400
> # ip a
> 123: ipvlan0:  mtu 1400 qdisc ...
> # ip l s ipvlan0 mtu auto
> # ip a
> 123: ipvlan0:  mtu 1500 (auto) qdisc ...
>
(Looks like you missed the last update I mentioned)
Here is the approach in details -

(a) At slave creation time - it's exactly how it's done currently
where slave copies masters' mtu. at the same time max_mtu of the slave
is set as the current mtu of the master.
(b) If slave updates mtu - ipvlan_change_mtu() will be called and the
slave's mtu will get set and it will set a flag indicating that slave
has changed it's mtu (dissociation from master if the mtu is different
from masters'). If slave mtu is set same as masters' then this flag
will get reset-ed indicating it wants to follow master (current
behavior).
(c) When master updates mtu - ipvlan_adj_mtu() gets called where all
slaves' max_mtu changes to the master's mtu value (clamping applies
for the all slaves which are not following master). All the slaves
which are following master (flag per slave) will get this new mtu.
Another consequence of this is that slaves' flag might get reset-ed if
the master's mtu is reduced to the value that was set earlier for the
slave (and it will start following slave).

The above should work? The user-space can query the mtu of the slave
device just like any other device. I was thinking about 'mtu_adj' with
some additional future extention but for now; we can live with a flag
on the slave device(s).

thanks,
--mahesh..

>  Jiri


Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-11 Thread महेश बंडेवार
On Thu, Jan 11, 2018 at 8:59 AM, Mahesh Bandewar (महेश बंडेवार)
 wrote:
> On Thu, Jan 11, 2018 at 3:25 AM, Jiri Benc  wrote:
>> On Wed, 10 Jan 2018 18:09:50 -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
>>> I still prefer the approach I had mentioned that uses 'mtu_adj'. In
>>> that approach you can leave those slaves which have changed their mtu
>>> to be lower than masters' but if master's mtu changes to larger value
>>> all other slaves will get updated mtu leaving behind the slaves who
>>> have opted to change their mtu on their own. Also the same thing is
>>> true when mtu get reduced at master.
>>
>> The problem with this magic behavior is, well, that it's magic. There's
>> no way to tell what happens with a given slave when the master's MTU
>> gets changed just by looking at the current configuration. There's also
>> no way to switch the magic behavior back on once the slave's MTU is
>> changed.
>>
> I guess the logic would be as simple as - if mtu_adj for a slave is
> set to 0, then it's
> following master otherwise not. By setting different mtu for a slave, you will
> set this mtu_adj a positive number which would mean it's not following master.
> So it's subjected to clamping when masters' mtu is reducing but should stay
> otherwise. Also when slave decides to follow master again, it can set the mtu
> to be same as masters' (making mtu_adj == 0) and then it would start following
> master again.
>
> Whether it's magic or not, it's the current behavior and I know several use
> cases depend on this behavior which would be broken otherwise. The
> approach I proposed keeps that going for those who depend on that while
> adds an ability to set mtu per slave for the use case mentioned in this
> patch-set too.
>
Actually we can just have boolean val per slave indicating it's changed locally
instead of maintaining the mtu diff which is useless at this moment.

>> At minimum, you'd need some kind of indication that the slave's MTU is
>> following the master. And a way to toggle this back.
>>
>> Keefe's patch is much saner, the behavior is completely deterministic.
>>
>>  Jiri


Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-11 Thread महेश बंडेवार
On Thu, Jan 11, 2018 at 3:25 AM, Jiri Benc  wrote:
> On Wed, 10 Jan 2018 18:09:50 -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
>> I still prefer the approach I had mentioned that uses 'mtu_adj'. In
>> that approach you can leave those slaves which have changed their mtu
>> to be lower than masters' but if master's mtu changes to larger value
>> all other slaves will get updated mtu leaving behind the slaves who
>> have opted to change their mtu on their own. Also the same thing is
>> true when mtu get reduced at master.
>
> The problem with this magic behavior is, well, that it's magic. There's
> no way to tell what happens with a given slave when the master's MTU
> gets changed just by looking at the current configuration. There's also
> no way to switch the magic behavior back on once the slave's MTU is
> changed.
>
I guess the logic would be as simple as - if mtu_adj for a slave is
set to 0, then it's
following master otherwise not. By setting different mtu for a slave, you will
set this mtu_adj a positive number which would mean it's not following master.
So it's subjected to clamping when masters' mtu is reducing but should stay
otherwise. Also when slave decides to follow master again, it can set the mtu
to be same as masters' (making mtu_adj == 0) and then it would start following
master again.

Whether it's magic or not, it's the current behavior and I know several use
cases depend on this behavior which would be broken otherwise. The
approach I proposed keeps that going for those who depend on that while
adds an ability to set mtu per slave for the use case mentioned in this
patch-set too.

> At minimum, you'd need some kind of indication that the slave's MTU is
> following the master. And a way to toggle this back.
>
> Keefe's patch is much saner, the behavior is completely deterministic.
>
>  Jiri


Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-10 Thread महेश बंडेवार
On Tue, Jan 9, 2018 at 11:21 PM,   wrote:
> From: Keefe Liu 
>
> The MTU of ipvlan interface should not bigger than the phy device, When we
> run following scripts, we will find there are some problems.
> Step1:
> ip link add link eth0 name ipv1 type ipvlan mode l2
> ip netns add net1
> ip link set dev ipv1 netns net1
> Step2:
> ip netns exec net1 ip link set dev ipv1 mtu 1501
> RTNETLINK answers: Invalid argument
> dmesg info: "ipv1: Invalid MTU 1501 requested, hw max 1500"
> Step3:
> ip link set dev eth0 mtu 1600
> ip netns exec net1 ip link set dev ipv1 mtu 1501
> RTNETLINK answers: Invalid argument
> dmesg info: "ipv1: Invalid MTU 1501 requested, hw max 1500"
> Step4:
> ip link set dev eth0 mtu 1400
> ip netns exec net1 ip link set dev ipv1 mtu 1500
> The result of Step2 is we expected, but the result of Step3 and Step4
> are not.
>
> This patch set ipvlan's MTU ranges from ETH_MIN_MTU to phy device's maximum
> MTU. When we create a new ipvlan device, its MTU will be set to the same
> with phy device if we not specify the mtu value. When we change the ipvlan
> device's MTU, ipvlan_change_mtu() will make sure the new MTU no bigger than
> the phy device's MTU.
>
> When we change the phy device's MTU, ipvlan devices attached to this
> phy device will check their MTU value. If ipvlan's MTU bigger than phy's,
> ipvlan's MTU will be set to the same with phy device, otherwise ipvlan's
> MTU remains unchanged.
>
> Signed-off-by: Keefe Liu 
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 36 
>  1 file changed, 28 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c 
> b/drivers/net/ipvlan/ipvlan_main.c
> index 30cb803..2ba071a 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -34,11 +34,6 @@ struct ipvlan_netns {
> .l3mdev_l3_rcv = ipvlan_l3_rcv,
>  };
>
> -static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device 
> *dev)
> -{
> -   ipvlan->dev->mtu = dev->mtu;
> -}
> -
>  static int ipvlan_register_nf_hook(struct net *net)
>  {
> struct ipvlan_netns *vnet = net_generic(net, ipvlan_netid);
> @@ -380,12 +375,24 @@ static int ipvlan_get_iflink(const struct net_device 
> *dev)
> return ipvlan->phy_dev->ifindex;
>  }
>
> +static int ipvlan_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +   struct ipvl_dev *ipvlan = netdev_priv(dev);
> +
> +   if (ipvlan->phy_dev->mtu < new_mtu)
> +   return -EINVAL;
> +
> +   dev->mtu = new_mtu;
> +   return 0;
> +}
> +
>  static const struct net_device_ops ipvlan_netdev_ops = {
> .ndo_init   = ipvlan_init,
> .ndo_uninit = ipvlan_uninit,
> .ndo_open   = ipvlan_open,
> .ndo_stop   = ipvlan_stop,
> .ndo_start_xmit = ipvlan_start_xmit,
> +   .ndo_change_mtu = ipvlan_change_mtu,
> .ndo_fix_features   = ipvlan_fix_features,
> .ndo_change_rx_flags= ipvlan_change_rx_flags,
> .ndo_set_rx_mode= ipvlan_set_multicast_mac_filter,
> @@ -581,10 +588,18 @@ int ipvlan_link_new(struct net *src_net, struct 
> net_device *dev,
> }
> }
>
> +   if (!tb[IFLA_MTU])
> +   dev->mtu = phy_dev->mtu;
> +   else if (dev->mtu > phy_dev->mtu)
> +   return -EINVAL;
> +
> +   /* MTU range: 68 - phy_dev->max_mtu */
> +   dev->min_mtu = ETH_MIN_MTU;
> +   dev->max_mtu = phy_dev->max_mtu;
> +
> ipvlan->phy_dev = phy_dev;
> ipvlan->dev = dev;
> ipvlan->sfeatures = IPVLAN_FEATURES;
> -   ipvlan_adjust_mtu(ipvlan, phy_dev);
> INIT_LIST_HEAD(&ipvlan->addrs);
>
> /* TODO Probably put random address here to be presented to the
> @@ -680,6 +695,8 @@ void ipvlan_link_setup(struct net_device *dev)
>  {
> ether_setup(dev);
>
> +   dev->min_mtu = 0;
> +   dev->max_mtu = ETH_MAX_MTU;
> dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING);
> dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
> dev->netdev_ops = &ipvlan_netdev_ops;
> @@ -775,8 +792,11 @@ static int ipvlan_device_event(struct notifier_block 
> *unused,
> break;
>
> case NETDEV_CHANGEMTU:
> -   list_for_each_entry(ipvlan, &port->ipvlans, pnode)
> -   ipvlan_adjust_mtu(ipvlan, dev);
> +   list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
> +   if (ipvlan->dev->mtu <= dev->mtu)
> +   continue;
> +   dev_set_mtu(ipvlan->dev, dev->mtu);
> +   }
Please look at the suggestion that I had given you in the earlier
thread. This approach will cause regression. Currently if the master
has MTU set to 1500, so all it's slave will get 1500 mtu. If master

Re: [PATCH] ipvlan: fix ipvlan MTU limits

2018-01-10 Thread महेश बंडेवार
On Tue, Jan 9, 2018 at 7:12 PM, liuqifa  wrote:
>
>> On Mon, Jan 8, 2018 at 10:48 PM,   wrote:
>> > From: Keefe Liu 
>> >
>> > The MTU of ipvlan interface should not bigger than the phy device,
>> > When we run following scripts, we will find there are some problems.
>> > Step1:
>> > ip link add link eth0 name ipv1 type ipvlan mode l2
>> > ip netns add net1
>> > ip link set dev ipv1 netns net1
>> > Step2:
>> > ip netns exec net1 ip link set dev ipv1 mtu 1501
>> > RTNETLINK answers: Invalid argument
>> > dmesg info: "ipv1: Invalid MTU 1501 requested, hw max 1500"
>> > Step3:
>> > ip link set dev eth0 mtu 1600
>> > ip netns exec net1 ip link set dev ipv1 mtu 1501
>> > RTNETLINK answers: Invalid argument
>> > dmesg info: "ipv1: Invalid MTU 1501 requested, hw max 1500"
>> > Step4:
>> > ip link set dev eth0 mtu 1400
>> > ip netns exec net1 ip link set dev ipv1 mtu 1500 The result of
>> > Step2 is we expected, but the result of Step3 and Step4 are not.
>> >
>> > This patch set ipvlan's maximum MTU to ETH_MAX_MTU, and when we
>> change
>> > the ipvlan device's MTU, ipvlan_change_mtu() will make sure the new
>> > MTU no larger than the phy device's MTU.
>> >
>> > Signed-off-by: Keefe Liu 
>> > ---
>> >  drivers/net/ipvlan/ipvlan_main.c | 14 ++
>> >  1 file changed, 14 insertions(+)
>> >
>> > diff --git a/drivers/net/ipvlan/ipvlan_main.c
>> > b/drivers/net/ipvlan/ipvlan_main.c
>> > index 30cb803..84c007d 100644
>> > --- a/drivers/net/ipvlan/ipvlan_main.c
>> > +++ b/drivers/net/ipvlan/ipvlan_main.c
>> > @@ -380,12 +380,24 @@ static int ipvlan_get_iflink(const struct net_device
>> *dev)
>> > return ipvlan->phy_dev->ifindex;  }
>> >
>> > +static int ipvlan_change_mtu(struct net_device *dev, int new_mtu) {
>> > +   struct ipvl_dev *ipvlan = netdev_priv(dev);
>> > +
>> > +   if (ipvlan->phy_dev->mtu < new_mtu)
>> > +   return -EINVAL;
>> > +
>> > +   dev->mtu = new_mtu;
>> > +   return 0;
>> > +}
>> > +
>> >  static const struct net_device_ops ipvlan_netdev_ops = {
>> > .ndo_init   = ipvlan_init,
>> > .ndo_uninit = ipvlan_uninit,
>> > .ndo_open   = ipvlan_open,
>> > .ndo_stop   = ipvlan_stop,
>> > .ndo_start_xmit = ipvlan_start_xmit,
>> > +   .ndo_change_mtu = ipvlan_change_mtu,
>> > .ndo_fix_features   = ipvlan_fix_features,
>> > .ndo_change_rx_flags= ipvlan_change_rx_flags,
>> > .ndo_set_rx_mode= ipvlan_set_multicast_mac_filter,
>> > @@ -680,6 +692,8 @@ void ipvlan_link_setup(struct net_device *dev)  {
>> > ether_setup(dev);
>> >
>> > +   dev->min_mtu = 0;
>> should be ETH_MIN_MTU since we expect the underlying device to be
>> ETH_ARPHDR and IPvlan deals with IPv4/IPv6.
>>
> [liuqifa] Yes, I agree with you. But I have another question, I found in 
> commit 9157208,  macvlan's min_mtu has been changed from ETH_MIN_MTU to 0, 
> I'd like to know
> what's the difference between macvlan and ipvlan.

IPvlan uses L3 address for its demux logic and min-L3-mtu set to be
ETH_MIN_MTU (68) for IPv4 packets with ip-options which no one uses
them now. Practically for IPvlan to operate, the header need be =
ether_hdr (14) + max(IPv4 hdr, IPv6 hdr) 40 = 54. So MTU which
includes HDRs + payload need to be more than 54. Same principle
applies to Macvlan except that its demux logic uses L2 instead of L3
(as in case of IPvlan). I don't know the motive behind setting it to 0
in macvlan.

>> > +   dev->max_mtu = ETH_MAX_MTU;
>> > dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING);
>> > dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> > dev->netdev_ops = &ipvlan_netdev_ops;
>> > --
>> > 1.8.3.1
>> >
>> >
>> These changes are not sufficient if you want to have different per-slave mtu
>> settings. One can always change the MTU of the master device and all  per-
>> slave settings will get wiped. I don't think that's a desired outcome.
> [liuqifa] I agree with you, I will reprogram the code to realize this goal.
I suggest you use the 'mtu_adj' per slave device that I had in the
code for the exact same reasons that you are implementing. I never got
around completing it (it was deleted in later updates), but you can
re-add that field. So when when ndo_change_mtu is invoked for a slave,
you recalculate 'mtu_adj' field and when ndo_change_mtu is called for
master, you adjust this field for all of the slaves. This will make
ndo_change_mtu work on both slave as well master without causing
issues.


Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-09 Thread महेश बंडेवार
On Tue, Jan 9, 2018 at 2:28 PM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
>> On Mon, Jan 8, 2018 at 10:36 AM, Serge E. Hallyn  wrote:
>> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
>> >> On Mon, Jan 8, 2018 at 10:11 AM, Serge E. Hallyn  wrote:
>> >> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
>> >> >> On Mon, Jan 8, 2018 at 7:47 AM, Serge E. Hallyn  
>> >> >> wrote:
>> >> >> > Quoting James Morris (james.l.mor...@oracle.com):
>> >> >> >> On Mon, 8 Jan 2018, Serge E. Hallyn wrote:
>> >> >> >> I meant in terms of "marking" a user ns as "controlled" type -- it's
>> >> >> >> unnecessary jargon from an end user point of view.
>> >> >> >
>> >> >> > Ah, yes, that was my point in
>> >> >> >
>> >> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/01845.html
>> >> >> > and
>> >> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/02276.html
>> >> >> >
>> >> >> >> This may happen internally but don't make it a special case with a
>> >> >> >> different name and don't bother users with internal concepts: simply
>> >> >> >> implement capability whitelists with the default having equivalent
>> >> >
>> >> > So the challenge is to have unprivileged users be contained, while
>> >> > allowing trusted workloads in containers created by a root user to
>> >> > bypass the restriction.
>> >> >
>> >> > Now, the current proposal actually doesn't support a root user starting
>> >> > an application that it doesn't quite trust in such a way that it *is*
>> >> > subject to the whitelist.
>> >>
>> >> Well, this is not hard since root process can spawn another process
>> >> and loose privileges before creating user-ns to be controlled by the
>> >> whitelist.
>> >
>> > It would have to drop cap_sys_admin for the container to be marked as
>> > "controlled", which may prevent the container runtime from properly 
>> > starting
>> > the container.
>> >
>> Yes, but that's a conflict of trusted operations (that requires
>> SYS_ADMIN) and untrusted processes it may spawn.
>
> Not sure I understand what you're saying, but
>
> I guess that in any case the task which is doing unshare(CLONE_NEWNS)
> can drop cap_sys_admin first.  Though that is harder if using clone,
> and it is awkward because it's not the container manager, but the user,
> who will judge whether the container workload should be restricted.
> So the container driver will add a flag like "run-controlled", and
> the driver will convert that to dropping a capability; which again
> is weird.  It would seem nicer to introduce a userns flag, 'caps-controlled'
> For an unprivileged userns, it is always set to 1, and root cannot
> change it.  For a root-created userns, it stays 0, but root can set it
> to 1 (using /proc file?).  In this way a either container runtime or just an
> admin script can say "no wait I want this container to still be controlled".
>
> Or we could instead add a second sysctl to decide whether all or only
> 'controlled' user namespaces should be controlled.  That's not pretty though.
>
Yes, I like your idea of a flag to clone() which will force the
user-ns to be controlled. This will have effect only on the root user
and any other user specifying is actually a NOP since those will be
controlled with or without that flag. But this is still an enhancement
to the current patch-set and I don't mind doing it as a follow-up
after this patch-series.

At this moment James has asked for Eric's input, which I believe
hasn't been recorded.


Re: [PATCH] ipvlan: fix ipvlan MTU limits

2018-01-09 Thread महेश बंडेवार
On Mon, Jan 8, 2018 at 10:48 PM,   wrote:
> From: Keefe Liu 
>
> The MTU of ipvlan interface should not bigger than the phy device, When we
> run following scripts, we will find there are some problems.
> Step1:
> ip link add link eth0 name ipv1 type ipvlan mode l2
> ip netns add net1
> ip link set dev ipv1 netns net1
> Step2:
> ip netns exec net1 ip link set dev ipv1 mtu 1501
> RTNETLINK answers: Invalid argument
> dmesg info: "ipv1: Invalid MTU 1501 requested, hw max 1500"
> Step3:
> ip link set dev eth0 mtu 1600
> ip netns exec net1 ip link set dev ipv1 mtu 1501
> RTNETLINK answers: Invalid argument
> dmesg info: "ipv1: Invalid MTU 1501 requested, hw max 1500"
> Step4:
> ip link set dev eth0 mtu 1400
> ip netns exec net1 ip link set dev ipv1 mtu 1500
> The result of Step2 is we expected, but the result of Step3 and Step4
> are not.
>
> This patch set ipvlan's maximum MTU to ETH_MAX_MTU, and when we change
> the ipvlan device's MTU, ipvlan_change_mtu() will make sure the new MTU
> no larger than the phy device's MTU.
>
> Signed-off-by: Keefe Liu 
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c 
> b/drivers/net/ipvlan/ipvlan_main.c
> index 30cb803..84c007d 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -380,12 +380,24 @@ static int ipvlan_get_iflink(const struct net_device 
> *dev)
> return ipvlan->phy_dev->ifindex;
>  }
>
> +static int ipvlan_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +   struct ipvl_dev *ipvlan = netdev_priv(dev);
> +
> +   if (ipvlan->phy_dev->mtu < new_mtu)
> +   return -EINVAL;
> +
> +   dev->mtu = new_mtu;
> +   return 0;
> +}
> +
>  static const struct net_device_ops ipvlan_netdev_ops = {
> .ndo_init   = ipvlan_init,
> .ndo_uninit = ipvlan_uninit,
> .ndo_open   = ipvlan_open,
> .ndo_stop   = ipvlan_stop,
> .ndo_start_xmit = ipvlan_start_xmit,
> +   .ndo_change_mtu = ipvlan_change_mtu,
> .ndo_fix_features   = ipvlan_fix_features,
> .ndo_change_rx_flags= ipvlan_change_rx_flags,
> .ndo_set_rx_mode= ipvlan_set_multicast_mac_filter,
> @@ -680,6 +692,8 @@ void ipvlan_link_setup(struct net_device *dev)
>  {
> ether_setup(dev);
>
> +   dev->min_mtu = 0;
should be ETH_MIN_MTU since we expect the underlying device to be
ETH_ARPHDR and IPvlan deals with IPv4/IPv6.

> +   dev->max_mtu = ETH_MAX_MTU;
> dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING);
> dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
> dev->netdev_ops = &ipvlan_netdev_ops;
> --
> 1.8.3.1
>
>
These changes are not sufficient if you want to have different
per-slave mtu settings. One can always change the MTU of the master
device and all  per-slave settings will get wiped. I don't think
that's a desired outcome.


Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-08 Thread महेश बंडेवार
On Mon, Jan 8, 2018 at 10:36 AM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
>> On Mon, Jan 8, 2018 at 10:11 AM, Serge E. Hallyn  wrote:
>> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
>> >> On Mon, Jan 8, 2018 at 7:47 AM, Serge E. Hallyn  wrote:
>> >> > Quoting James Morris (james.l.mor...@oracle.com):
>> >> >> On Mon, 8 Jan 2018, Serge E. Hallyn wrote:
>> >> >> I meant in terms of "marking" a user ns as "controlled" type -- it's
>> >> >> unnecessary jargon from an end user point of view.
>> >> >
>> >> > Ah, yes, that was my point in
>> >> >
>> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/01845.html
>> >> > and
>> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/02276.html
>> >> >
>> >> >> This may happen internally but don't make it a special case with a
>> >> >> different name and don't bother users with internal concepts: simply
>> >> >> implement capability whitelists with the default having equivalent
>> >
>> > So the challenge is to have unprivileged users be contained, while
>> > allowing trusted workloads in containers created by a root user to
>> > bypass the restriction.
>> >
>> > Now, the current proposal actually doesn't support a root user starting
>> > an application that it doesn't quite trust in such a way that it *is*
>> > subject to the whitelist.
>>
>> Well, this is not hard since root process can spawn another process
>> and loose privileges before creating user-ns to be controlled by the
>> whitelist.
>
> It would have to drop cap_sys_admin for the container to be marked as
> "controlled", which may prevent the container runtime from properly starting
> the container.
>
Yes, but that's a conflict of trusted operations (that requires
SYS_ADMIN) and untrusted processes it may spawn.

>> You need an ability to preserve the creation of user-namespaces that
>> exhibit 'the uncontrolled behavior' and only trusted/privileged (root)
>> user should have it which is maintained here.
>>
>> > Which is unfortunate.  But apart from using
>> > ptags or a cgroup, I can't think of a good way to get us everything we
>> > want:
>> >
>> > 1. unprivileged users always restricted
>> > 2. existing unprivileged containers become restricted when whitelist
>> > is enabled
>> > 3. privileged users are able to create containers which are not restricted
>>
>> all this is achieved by the patch-set without any changes to the
>> application with the above knob.
>>
>> > 4. privileged users are able to create containers which *are* restricted
>> >
>> With this patch-set; the root user process can fork another process
>> with less privileges before creating a user-ns if the exec-ed process
>> cannot be trusted. So there is a way with little modification as
>> opposed to nothing available at this moment for this scenario.


Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-08 Thread महेश बंडेवार
On Mon, Jan 8, 2018 at 10:11 AM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
>> On Mon, Jan 8, 2018 at 7:47 AM, Serge E. Hallyn  wrote:
>> > Quoting James Morris (james.l.mor...@oracle.com):
>> >> On Mon, 8 Jan 2018, Serge E. Hallyn wrote:
>> >> I meant in terms of "marking" a user ns as "controlled" type -- it's
>> >> unnecessary jargon from an end user point of view.
>> >
>> > Ah, yes, that was my point in
>> >
>> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/01845.html
>> > and
>> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/02276.html
>> >
>> >> This may happen internally but don't make it a special case with a
>> >> different name and don't bother users with internal concepts: simply
>> >> implement capability whitelists with the default having equivalent
>
> So the challenge is to have unprivileged users be contained, while
> allowing trusted workloads in containers created by a root user to
> bypass the restriction.
>
> Now, the current proposal actually doesn't support a root user starting
> an application that it doesn't quite trust in such a way that it *is*
> subject to the whitelist.

Well, this is not hard since root process can spawn another process
and loose privileges before creating user-ns to be controlled by the
whitelist.
You need an ability to preserve the creation of user-namespaces that
exhibit 'the uncontrolled behavior' and only trusted/privileged (root)
user should have it which is maintained here.

> Which is unfortunate.  But apart from using
> ptags or a cgroup, I can't think of a good way to get us everything we
> want:
>
> 1. unprivileged users always restricted
> 2. existing unprivileged containers become restricted when whitelist
> is enabled
> 3. privileged users are able to create containers which are not restricted

all this is achieved by the patch-set without any changes to the
application with the above knob.

> 4. privileged users are able to create containers which *are* restricted
>
With this patch-set; the root user process can fork another process
with less privileges before creating a user-ns if the exec-ed process
cannot be trusted. So there is a way with little modification as
opposed to nothing available at this moment for this scenario.


Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-08 Thread महेश बंडेवार
On Mon, Jan 8, 2018 at 7:47 AM, Serge E. Hallyn  wrote:
> Quoting James Morris (james.l.mor...@oracle.com):
>> On Mon, 8 Jan 2018, Serge E. Hallyn wrote:
>>
>> > > Also, why do we need the concept of a controlled user-ns at all, if the
>> > > default whitelist maintains existing behavior?
>> >
>> > In past discussions two uses have been brought up:
>> >
>> > 1. if an 0-day is discovered which is exacerbated by a specific
>> > privilege in user namespaces, that privilege could be turned off until a
>> > reboot with a fixed kernel is scheduled, without fully disabling all
>> > containers.
>> >
>> > 2. some systems may be specifically designed to run software which
>> > only requires a few capabilities in a userns.  In that case all others
>> > could be disabled.
>> >
>>
>> I meant in terms of "marking" a user ns as "controlled" type -- it's
>> unnecessary jargon from an end user point of view.
>
> Ah, yes, that was my point in
>
> http://lkml.iu.edu/hypermail/linux/kernel/1711.1/01845.html
> and
> http://lkml.iu.edu/hypermail/linux/kernel/1711.1/02276.html
>
>> This may happen internally but don't make it a special case with a
>> different name and don't bother users with internal concepts: simply
>> implement capability whitelists with the default having equivalent
>> behavior of everything allowed.  Then, document the semantics of the
>> whitelist in terms of inheritance etc., as a feature of user namespaces,
>> not as a "type" of user namespace.
>
> The problem with making them inheritable is that an adversarial user
> can just create a user namespace at boot that sits and waits for an
> 0day to be published, then log in and attach to that namespace later,
> since it has already inherited the open whitelist.
>
> It feels like there must be some other approach that doesn't feel as...
> band-aid-y as this does, but I'm not sure what.

We had a long discussion thread about this approach in the past and
many of these points have been discussed there. Serge is to the point
in terms of both the points (0-day issue as well as sandbox
environment). At this moment we are exposed to those threats and apart
from this patch-set I'm not aware of anything that handles it. Of
course there are other alternatives that block user-ns creation
altogether but blocking user-ns is not a real solution that works in
every use-case. I'm open other ideas (if any) that do not block
creation of user-ns, but lack of those will keep the 0-day issue
lingering especially for environments where 'user-ns creation' is used
heavily.

'Controlled-user-ns' jargon is within the kernel-space and is not
exposed to the users (we don't have any API to do that), but I used
those terms to explain within the kernel-community what this patch-set
is trying to do. The user-application does not need nor need to know
any of these changes as such. However, this additional knob gives
admin an ability to control their behavior in those two circumstances.
The default behavior that chose in this patch-set is so that it
doesn't cause regression to anyone whatever is their use case is but
now admin can set whatever default behavior they wish in the
boot-scripts to suite their needs.

Thanks,
--mahesh..


Re: bonding: Completion of error handling around bond_update_slave_arr()

2018-01-04 Thread महेश बंडेवार
On Thu, Jan 4, 2018 at 12:19 AM, SF Markus Elfring
 wrote:
>> If you see 8 out of 9 call sites in this file ignore the return value.
>
> How do you think about to fix error detection and corresponding
> exception handling then?
>
If I understand your question correctly - not having memory is not a
correctable error and hence there are consequences. In this case, the
slave_arr is not going to be rebuilt and this might mean loosing
packets (if the interface was dropped) or not using the interface to
send packets if  that was added (very unlikely case). From host's
perspective, however, this might be the last thing you want to worry
about when there is no memory left.

> Regards,
> Markus


Re: [PATCHv4 0/2] capability controlled user-namespaces

2018-01-03 Thread महेश बंडेवार
On Wed, Jan 3, 2018 at 8:44 AM, Eric W. Biederman  wrote:
> Mahesh Bandewar  writes:
>
>> From: Mahesh Bandewar 
>>
>> TL;DR version
>> -
>> Creating a sandbox environment with namespaces is challenging
>> considering what these sandboxed processes can engage into. e.g.
>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>> Current form of user-namespaces, however, if changed a bit can allow
>> us to create a sandbox environment without locking down user-
>> namespaces.
>
> In other conversations it appears it has been pointed out that user
> namespaces are not necessarily safe under no_new_privs.  In theory
> user namespaces should be safe but in practice not so much.
>
> So let me ask.  Would your concerns be addressed if we simply made
> creation and joining of user namespaces impossible in a no_new_privs
> sandbox?
>
Isn't this another form of locking down user-ns similar to setting per
user-ns sysctl max_userns = 0?

Having said that, not allowing processes to create and/or attach
user-namespaces is going to be problematic and possibly a regression.
This (current) patchset doesn't do that. It allows users to create
user-ns's of any depth and number permitted by per-ns max_userns
sysctl. However one can decide what to take-off and what to leave in
terms of capabilities for the sandbox environment.

--mahesh..

> Eric
>
[...]


Re: bonding: Delete an error message for a failed memory allocation in bond_update_slave_arr()

2018-01-03 Thread महेश बंडेवार
On Wed, Jan 3, 2018 at 12:45 AM, SF Markus Elfring
 wrote:
>>> Omit an extra message for a memory allocation failure in this function.
>>>
>>> This issue was detected by using the Coccinelle software.
>>>
>> What is the issue with this message?
>
> * Is it redundant?
>
> * Would a Linux allocation failure report be already sufficient here?
>
If you see 8 out of 9 call sites in this file ignore the return value.
This message in the log could give a clue when debugging. Unless it's
spamming it's not harmful, or is it?

> Regards,
> Markus


Re: [PATCH] bonding: Delete an error message for a failed memory allocation in bond_update_slave_arr()

2018-01-02 Thread महेश बंडेवार
On Mon, Jan 1, 2018 at 8:07 AM, SF Markus Elfring
 wrote:
> From: Markus Elfring 
> Date: Mon, 1 Jan 2018 17:00:04 +0100
>
> Omit an extra message for a memory allocation failure in this function.
>
> This issue was detected by using the Coccinelle software.
>
What is the issue with this message?

> Signed-off-by: Markus Elfring 
> ---
>  drivers/net/bonding/bond_main.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index c669554d70bb..a96e0c9cc4bf 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3910,7 +3910,6 @@ int bond_update_slave_arr(struct bonding *bond, struct 
> slave *skipslave)
>   GFP_KERNEL);
> if (!new_arr) {
> ret = -ENOMEM;
> -   pr_err("Failed to build slave-array.\n");
> goto out;
> }
> if (BOND_MODE(bond) == BOND_MODE_8023AD) {
> --
> 2.15.1
>


Re: [PATCHv3 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2018-01-02 Thread महेश बंडेवार
On Sat, Dec 30, 2017 at 12:50 AM, Michael Kerrisk (man-pages)
 wrote:
> Hello Mahesh,
>
> On 12/05/2017 11:31 PM, Mahesh Bandewar wrote:
>> From: Mahesh Bandewar 
>>
>> Add a sysctl variable kernel.controlled_userns_caps_whitelist. This
>> takes input as capability mask expressed as two comma separated hex
>> u32 words. The mask, however, is stored in kernel as kernel_cap_t type.
>
> Just by the way, why is it not expressed as a 64 bit value? (The answer
> to that question should I think be part of this commit message.)
>
I think I mentioned in the earlier threads but it's as simple as using
the existing kernel API that deals with larger bit masks/arrays.
Capability mask is an array of u32 and one cannot guarantee that
kernel_cap_t is only 64 bits. So expressing it as u32 hex values makes
logical sense. Yes, one can simplify this even further and make this
sysctl accept string values. But bringing possibly buggy string
manipulation inside kernel is far more desirable than an already used
API that is designed specifically for this purpose. Also not expecting
this sysctl to be changed very often so robustness is preferred over
simplicity if that simplicity is accompanied by buggy behavior.

I'll add similar text in the commit log.

>> Any capabilities that are not part of this mask will be controlled and
>> will not be allowed to processes in controlled user-ns.
>>
>> Acked-by: Serge Hallyn 
>> Signed-off-by: Mahesh Bandewar 
>> ---
>> v3:
>>   Added couple of comments as requested by Serge Hallyn
>> v2:
>>   Rebase
>> v1:
>>   Initial submission
>>
>>  Documentation/sysctl/kernel.txt | 21 ++
>>  include/linux/capability.h  |  3 +++
>>  kernel/capability.c | 47 
>> +
>>  kernel/sysctl.c |  5 +
>>  4 files changed, 76 insertions(+)
>>
>> diff --git a/Documentation/sysctl/kernel.txt 
>> b/Documentation/sysctl/kernel.txt
>> index 694968c7523c..a1d39dbae847 100644
>> --- a/Documentation/sysctl/kernel.txt
>> +++ b/Documentation/sysctl/kernel.txt
>> @@ -25,6 +25,7 @@ show up in /proc/sys/kernel:
>>  - bootloader_version  [ X86 only ]
>>  - callhome[ S390 only ]
>>  - cap_last_cap
>> +- controlled_userns_caps_whitelist
>>  - core_pattern
>>  - core_pipe_limit
>>  - core_uses_pid
>> @@ -187,6 +188,26 @@ CAP_LAST_CAP from the kernel.
>>
>>  ==
>>
>> +controlled_userns_caps_whitelist
>> +
>> +Capability mask that is whitelisted for "controlled" user namespaces.
>
> How is a user-ns marked as "controlled"? Please clarify this.
>
Hmm, it's mentioned in this same paragraph (at the end). Looks like you miss it!

>> +Any capability that is missing from this mask will not be allowed to
>> +any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
>> +is not part of this mask, then processes running inside any controlled
>> +userns's will not be allowed to perform action that needs CAP_NET_RAW
>> +capability. However, processes that are attached to a parent user-ns
>> +hierarchy that is *not* controlled and has CAP_NET_RAW can continue
>> +performing those actions. User-namespaces are marked "controlled" at
>> +the time of their creation based on the capabilities of the creator.
>> +A process that does not have CAP_SYS_ADMIN will create user-namespaces
>> +that are controlled.
>> +
>> +The value is expressed as two comma separated hex words (u32). This
>> +sysctl is avaialble in init-ns and users with CAP_SYS_ADMIN in init-ns
>> +are allowed to make changes.
>
> Could you add here a shell session that demonstrates the use of these
> interfaces and how they allow/disallow capabilities.
>
OK, will add something to address this.

> Is there a way that a process can see whether it is a controlled user-ns
> vs an uncontrolled user-ns? I think it would be good to explain in this
> doc patch.
>
> Thanks,
>
> Michael
>
>> +==
>> +
>>  core_pattern:
>>
>>  core_pattern is used to specify a core dumpfile pattern name.
>> diff --git a/include/linux/capability.h b/include/linux/capability.h
>> index f640dcbc880c..7d79a4689625 100644
>> --- a/include/linux/capability.h
>> +++ b/include/linux/capability.h
>> @@ -14,6 +14,7 @@
>>  #define _LINUX_CAPABILITY_H
>>
>>  #include 
>> +#include 
>>
>>
>>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
>> @@ -248,6 +249,8 @@ extern bool ptracer_capable(struct task_struct *tsk, 
>> struct user_namespace *ns);
>>
>>  /* audit system wants to get cap info from files as well */
>>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
>> cpu_vfs_cap_data *cpu_caps);
>> +int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>> +  void __user *buff, size_t *lenp, loff_t 
>> *ppos);
>>
>>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
>> size);
>>
>> diff --g

Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-02 Thread महेश बंडेवार
Hello Michael,

I really don't want to turn this into how-to-hack guide but I do see
few points in your argument to make the case clearer. Please see the
comments inline.

On Sat, Dec 30, 2017 at 12:50 AM, Michael Kerrisk (man-pages)
 wrote:
> Hello Mahesh,
>
> On 12/28/2017 01:45 AM, Mahesh Bandewar (महेश बंडेवार) wrote:
>> On Wed, Dec 27, 2017 at 12:23 PM, Michael Kerrisk (man-pages)
>>  wrote:
>>> Hello Mahesh,
>>>
>>> On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)
>>>  wrote:
>>>> Hello James,
>>>>
>>>> Seems like I missed your name to be added into the review of this
>>>> patch series. Would you be willing be pull this into the security
>>>> tree? Serge Hallyn has already ACKed it.
>>>
>>> We seem to have no formal documentation/specification of this feature.
>>> I think that should be written up before this patch goes into
>>> mainline...
>>>
>> absolutely. I have added enough information into the Documentation dir
>> relevant to this feature (please look at the  individual patches),
>> that could be used. I could help if needed.
>
> Yes, but I think that the documentation is rather incomplete.
> I'll also reply to the relevant Documentation thread.
>
> See also some comments below about this commit message, which
> should make things *much* easier for the reader.
>
>>>> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar  
>>>> wrote:
>>>>> From: Mahesh Bandewar 
>>>>>
>>>>> TL;DR version
>>>>> -
>>>>> Creating a sandbox environment with namespaces is challenging
>>>>> considering what these sandboxed processes can engage into. e.g.
>>>>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>>>>> Current form of user-namespaces, however, if changed a bit can allow
>>>>> us to create a sandbox environment without locking down user-
>>>>> namespaces.
>>>>>
>>>>> Detailed version
>>>>> 
>>>>>
>>>>> Problem
>>>>> ---
>>>>> User-namespaces in the current form have increased the attack surface as
>>>>> any process can acquire capabilities which are not available to them (by
>>>>> default) by performing combination of clone()/unshare()/setns() syscalls.
>>>>>
>>>>> #define _GNU_SOURCE
>>>>> #include 
>>>>> #include 
>>>>> #include 
>>>>>
>>>>> int main(int ac, char **av)
>>>>> {
>>>>> int sock = -1;
>>>>>
>>>>> printf("Attempting to open RAW socket before unshare()...\n");
>>>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>>>> if (sock < 0) {
>>>>> perror("socket() SOCK_RAW failed: ");
>>>>> } else {
>>>>> printf("Successfully opened RAW-Sock before unshare().\n");
>>>>> close(sock);
>>>>> sock = -1;
>>>>> }
>>>>>
>>>>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>>>>> perror("unshare() failed: ");
>>>>> return 1;
>>>>> }
>>>>>
>>>>> printf("Attempting to open RAW socket after unshare()...\n");
>>>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>>>> if (sock < 0) {
>>>>> perror("socket() SOCK_RAW failed: ");
>>>>> } else {
>>>>> printf("Successfully opened RAW-Sock after unshare().\n");
>>>>> close(sock);
>>>>> sock = -1;
>>>>> }
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> The above example shows how easy it is to acquire NET_RAW capabilities
>>>>> and once acquired, these processes could take benefit of above mentioned
>>>>> or similar issues discovered/undiscovered with malicious intent.
>
> But you do not actually describe what the problem is. I think
> it's not sufficient to simply refer to some CVEs.
> Your mail message/commit should clearly describe what the issue is,

Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-02 Thread महेश बंडेवार
On Sat, Dec 30, 2017 at 12:31 AM, James Morris
 wrote:
> On Wed, 27 Dec 2017, Mahesh Bandewar (महेश बंडेवार) wrote:
>
>> Hello James,
>>
>> Seems like I missed your name to be added into the review of this
>> patch series. Would you be willing be pull this into the security
>> tree? Serge Hallyn has already ACKed it.
>
> Sure!
>
Thank you James.
>
>>
>> Thanks,
>> --mahesh..
>>
>> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar  wrote:
>> > From: Mahesh Bandewar 
>> >
>> > TL;DR version
>> > -
>> > Creating a sandbox environment with namespaces is challenging
>> > considering what these sandboxed processes can engage into. e.g.
>> > CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>> > Current form of user-namespaces, however, if changed a bit can allow
>> > us to create a sandbox environment without locking down user-
>> > namespaces.
>> >
>> > Detailed version
>> > 
>> >
>> > Problem
>> > ---
>> > User-namespaces in the current form have increased the attack surface as
>> > any process can acquire capabilities which are not available to them (by
>> > default) by performing combination of clone()/unshare()/setns() syscalls.
>> >
>> > #define _GNU_SOURCE
>> > #include 
>> > #include 
>> > #include 
>> >
>> > int main(int ac, char **av)
>> > {
>> > int sock = -1;
>> >
>> > printf("Attempting to open RAW socket before unshare()...\n");
>> > sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>> > if (sock < 0) {
>> > perror("socket() SOCK_RAW failed: ");
>> > } else {
>> > printf("Successfully opened RAW-Sock before unshare().\n");
>> > close(sock);
>> > sock = -1;
>> > }
>> >
>> > if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>> > perror("unshare() failed: ");
>> > return 1;
>> > }
>> >
>> > printf("Attempting to open RAW socket after unshare()...\n");
>> > sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>> > if (sock < 0) {
>> > perror("socket() SOCK_RAW failed: ");
>> > } else {
>> > printf("Successfully opened RAW-Sock after unshare().\n");
>> > close(sock);
>> > sock = -1;
>> > }
>> >
>> > return 0;
>> > }
>> >
>> > The above example shows how easy it is to acquire NET_RAW capabilities
>> > and once acquired, these processes could take benefit of above mentioned
>> > or similar issues discovered/undiscovered with malicious intent. Note
>> > that this is just an example and the problem/solution is not limited
>> > to NET_RAW capability *only*.
>> >
>> > The easiest fix one can apply here is to lock-down user-namespaces which
>> > many of the distros do (i.e. don't allow users to create user namespaces),
>> > but unfortunately that prevents everyone from using them.
>> >
>> > Approach
>> > 
>> > Introduce a notion of 'controlled' user-namespaces. Every process on
>> > the host is allowed to create user-namespaces (governed by the limit
>> > imposed by per-ns sysctl) however, mark user-namespaces created by
>> > sandboxed processes as 'controlled'. Use this 'mark' at the time of
>> > capability check in conjunction with a global capability whitelist.
>> > If the capability is not whitelisted, processes that belong to
>> > controlled user-namespaces will not be allowed.
>> >
>> > Once a user-ns is marked as 'controlled'; all its child user-
>> > namespaces are marked as 'controlled' too.
>> >
>> > A global whitelist is list of capabilities governed by the
>> > sysctl which is available to (privileged) user in init-ns to modify
>> > while it's applicable to all controlled user-namespaces on the host.
>> >
>> > Marking user-namespaces controlled without modifying the whitelist is
>> > equivalent of the current behavior. The default value of whitelist includes
>> > all capabilities so that the compatibility is maintained. However it gives
>> > admins fine-grained ability to control various capabilities system wide
>> > without locking down user-namespaces.
>> >
>> > Please see individual patches in this series.
>> >
>> > Mahesh Bandewar (2):
>> >   capability: introduce sysctl for controlled user-ns capability whitelist
>> >   userns: control capabilities of some user namespaces
>> >
>> >  Documentation/sysctl/kernel.txt | 21 +
>> >  include/linux/capability.h  |  7 ++
>> >  include/linux/user_namespace.h  | 25 
>> >  kernel/capability.c | 52 
>> > +
>> >  kernel/sysctl.c |  5 
>> >  kernel/user_namespace.c |  4 
>> >  security/commoncap.c|  8 +++
>> >  7 files changed, 122 insertions(+)
>> >
>> > --
>> > 2.15.0.531.g2ccb3012c9-goog
>> >
>>
>
> --
> James Morris
> 


Re: [PATCHv3 0/2] capability controlled user-namespaces

2017-12-27 Thread महेश बंडेवार
On Wed, Dec 27, 2017 at 12:23 PM, Michael Kerrisk (man-pages)
 wrote:
> Hello Mahesh,
>
> On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)
>  wrote:
>> Hello James,
>>
>> Seems like I missed your name to be added into the review of this
>> patch series. Would you be willing be pull this into the security
>> tree? Serge Hallyn has already ACKed it.
>
> We seem to have no formal documentation/specification of this feature.
> I think that should be written up before this patch goes into
> mainline...
>
absolutely. I have added enough information into the Documentation dir
relevant to this feature (please look at the  individual patches),
that could be used. I could help if needed.

thanks,
--mahesh..

> Cheers,
>
> Michael
>
>
>>
>> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar  wrote:
>>> From: Mahesh Bandewar 
>>>
>>> TL;DR version
>>> -
>>> Creating a sandbox environment with namespaces is challenging
>>> considering what these sandboxed processes can engage into. e.g.
>>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>>> Current form of user-namespaces, however, if changed a bit can allow
>>> us to create a sandbox environment without locking down user-
>>> namespaces.
>>>
>>> Detailed version
>>> 
>>>
>>> Problem
>>> ---
>>> User-namespaces in the current form have increased the attack surface as
>>> any process can acquire capabilities which are not available to them (by
>>> default) by performing combination of clone()/unshare()/setns() syscalls.
>>>
>>> #define _GNU_SOURCE
>>> #include 
>>> #include 
>>> #include 
>>>
>>> int main(int ac, char **av)
>>> {
>>> int sock = -1;
>>>
>>> printf("Attempting to open RAW socket before unshare()...\n");
>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>> if (sock < 0) {
>>> perror("socket() SOCK_RAW failed: ");
>>> } else {
>>> printf("Successfully opened RAW-Sock before unshare().\n");
>>> close(sock);
>>> sock = -1;
>>> }
>>>
>>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>>> perror("unshare() failed: ");
>>> return 1;
>>> }
>>>
>>> printf("Attempting to open RAW socket after unshare()...\n");
>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>> if (sock < 0) {
>>> perror("socket() SOCK_RAW failed: ");
>>> } else {
>>> printf("Successfully opened RAW-Sock after unshare().\n");
>>> close(sock);
>>> sock = -1;
>>> }
>>>
>>> return 0;
>>> }
>>>
>>> The above example shows how easy it is to acquire NET_RAW capabilities
>>> and once acquired, these processes could take benefit of above mentioned
>>> or similar issues discovered/undiscovered with malicious intent. Note
>>> that this is just an example and the problem/solution is not limited
>>> to NET_RAW capability *only*.
>>>
>>> The easiest fix one can apply here is to lock-down user-namespaces which
>>> many of the distros do (i.e. don't allow users to create user namespaces),
>>> but unfortunately that prevents everyone from using them.
>>>
>>> Approach
>>> 
>>> Introduce a notion of 'controlled' user-namespaces. Every process on
>>> the host is allowed to create user-namespaces (governed by the limit
>>> imposed by per-ns sysctl) however, mark user-namespaces created by
>>> sandboxed processes as 'controlled'. Use this 'mark' at the time of
>>> capability check in conjunction with a global capability whitelist.
>>> If the capability is not whitelisted, processes that belong to
>>> controlled user-namespaces will not be allowed.
>>>
>>> Once a user-ns is marked as 'controlled'; all its child user-
>>> namespaces are marked as 'controlled' too.
>>>
>>> A global whitelist is list of capabilities governed by the
>>> sysctl which is available to (privileged) user in init-ns to modify
>>> while it's applicable to all

Re: [PATCHv3 0/2] capability controlled user-namespaces

2017-12-27 Thread महेश बंडेवार
Hello James,

Seems like I missed your name to be added into the review of this
patch series. Would you be willing be pull this into the security
tree? Serge Hallyn has already ACKed it.

Thanks,
--mahesh..

On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar  wrote:
> From: Mahesh Bandewar 
>
> TL;DR version
> -
> Creating a sandbox environment with namespaces is challenging
> considering what these sandboxed processes can engage into. e.g.
> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
> Current form of user-namespaces, however, if changed a bit can allow
> us to create a sandbox environment without locking down user-
> namespaces.
>
> Detailed version
> 
>
> Problem
> ---
> User-namespaces in the current form have increased the attack surface as
> any process can acquire capabilities which are not available to them (by
> default) by performing combination of clone()/unshare()/setns() syscalls.
>
> #define _GNU_SOURCE
> #include 
> #include 
> #include 
>
> int main(int ac, char **av)
> {
> int sock = -1;
>
> printf("Attempting to open RAW socket before unshare()...\n");
> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
> if (sock < 0) {
> perror("socket() SOCK_RAW failed: ");
> } else {
> printf("Successfully opened RAW-Sock before unshare().\n");
> close(sock);
> sock = -1;
> }
>
> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
> perror("unshare() failed: ");
> return 1;
> }
>
> printf("Attempting to open RAW socket after unshare()...\n");
> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
> if (sock < 0) {
> perror("socket() SOCK_RAW failed: ");
> } else {
> printf("Successfully opened RAW-Sock after unshare().\n");
> close(sock);
> sock = -1;
> }
>
> return 0;
> }
>
> The above example shows how easy it is to acquire NET_RAW capabilities
> and once acquired, these processes could take benefit of above mentioned
> or similar issues discovered/undiscovered with malicious intent. Note
> that this is just an example and the problem/solution is not limited
> to NET_RAW capability *only*.
>
> The easiest fix one can apply here is to lock-down user-namespaces which
> many of the distros do (i.e. don't allow users to create user namespaces),
> but unfortunately that prevents everyone from using them.
>
> Approach
> 
> Introduce a notion of 'controlled' user-namespaces. Every process on
> the host is allowed to create user-namespaces (governed by the limit
> imposed by per-ns sysctl) however, mark user-namespaces created by
> sandboxed processes as 'controlled'. Use this 'mark' at the time of
> capability check in conjunction with a global capability whitelist.
> If the capability is not whitelisted, processes that belong to
> controlled user-namespaces will not be allowed.
>
> Once a user-ns is marked as 'controlled'; all its child user-
> namespaces are marked as 'controlled' too.
>
> A global whitelist is list of capabilities governed by the
> sysctl which is available to (privileged) user in init-ns to modify
> while it's applicable to all controlled user-namespaces on the host.
>
> Marking user-namespaces controlled without modifying the whitelist is
> equivalent of the current behavior. The default value of whitelist includes
> all capabilities so that the compatibility is maintained. However it gives
> admins fine-grained ability to control various capabilities system wide
> without locking down user-namespaces.
>
> Please see individual patches in this series.
>
> Mahesh Bandewar (2):
>   capability: introduce sysctl for controlled user-ns capability whitelist
>   userns: control capabilities of some user namespaces
>
>  Documentation/sysctl/kernel.txt | 21 +
>  include/linux/capability.h  |  7 ++
>  include/linux/user_namespace.h  | 25 
>  kernel/capability.c | 52 
> +
>  kernel/sysctl.c |  5 
>  kernel/user_namespace.c |  4 
>  security/commoncap.c|  8 +++
>  7 files changed, 122 insertions(+)
>
> --
> 2.15.0.531.g2ccb3012c9-goog
>


Re: [PATCH next] ipvlan: add L2 check for packets arriving via virtual devices

2017-12-11 Thread महेश बंडेवार
On Mon, Dec 11, 2017 at 8:15 AM, David Miller  wrote:
> From: Mahesh Bandewar 
> Date: Thu,  7 Dec 2017 15:15:43 -0800
>
>> From: Mahesh Bandewar 
>>
>> Packets that don't have dest mac as the mac of the master device should
>> not be entertained by the IPvlan rx-handler. This is mostly true as the
>> packet path mostly takes care of that, except when the master device is
>> a virtual device. As demonstrated in the following case -
>  ...
>> This patch adds that missing check in the IPvlan rx-handler.
>>
>> Reported-by: Amit Sikka 
>> Signed-off-by: Mahesh Bandewar 
>
> Applied, but it's a shame that the data plane takes on this new MAC
> compare operation.
Your comment made me think little more about this and a discussion
with Eric kind of put things in perspective. eth_type_trans() does the
right thing and sets the packet_type correctly (when .ndo_xmit of veth
is called). However IPvlan is over-aggressive in packet scrubbing and
that scrub changes packet type. This causes the actual problem. It's
not clear to me why skb_scrub_packet() changes the packet type to
PACKET_HOST unconditionally? But that's another issue.

I'll send another patch to remove excessive scrubbing in IPvlan and
revert of this patch so that this additional comparison (though not
expensive!) can be avoided.

Thanks,
--mahesh..


Re: [PATCHv2 2/2] userns: control capabilities of some user namespaces

2017-12-05 Thread महेश बंडेवार
On Wed, Nov 29, 2017 at 9:57 AM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
>> On Tue, Nov 28, 2017 at 3:04 PM, Serge E. Hallyn  wrote:
>> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
>> > ...
>> >> >> diff --git a/security/commoncap.c b/security/commoncap.c
>> >> >> index fc46f5b85251..89103f16ac37 100644
>> >> >> --- a/security/commoncap.c
>> >> >> +++ b/security/commoncap.c
>> >> >> @@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct 
>> >> >> user_namespace *targ_ns,
>> >> >>  {
>> >> >>   struct user_namespace *ns = targ_ns;
>> >> >>
>> >> >> + /* If the capability is controlled and user-ns that process
>> >> >> +  * belongs-to is 'controlled' then return EPERM and no need
>> >> >> +  * to check the user-ns hierarchy.
>> >> >> +  */
>> >> >> + if (is_user_ns_controlled(cred->user_ns) &&
>> >> >> + is_capability_controlled(cap))
>> >> >> + return -EPERM;
>> >> >
>> >> > I'd be curious to see the performance impact on this on a regular
>> >> > workload (kernel build?) in a controlled ns.
>> >> >
>> >> Should it affect? If at all, it should be +ve since, the recursive
>> >> user-ns hierarchy lookup is avoided with the above check if the
>> >> capability is controlled.
>> >
>> > Yes but I expect that to be the rare case for normal lxc installs
>> > (which are of course what I am interested in)
>> >
>> >>  The additional cost otherwise is this check
>> >> per cap_capable() call.
>> >
>> > And pipeline refetching?
>> >
>> > Capability calls also shouldn't be all that frequent, but still I'm
>> > left wondering...
>>
>> Correct, and capability checks are part of the control-path and not
>> the data-path so shouldn't matter but I guess it doesn't hurt to
>> find-out the number. Do you have any workload in mind, that we can use
>> for this test/benchmark?
>
> I suppose if you did both (a) a kernel build and (b) a webserve
> like https://github.com/m3ng9i/ran , being hit for a minute by a
> heavy load of requests, those two together would be re-assuring.
>
Well, I did (a) and (b). Here are the results.

(a0) I used the ubuntu-artful (17.10) vm instance with standard kernel
to compile the kernel

mahesh@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s clean
mahesh@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s
real 6m47.525s
user 22m37.424s
sys 2m44.745s

(b0) Now in an user-namespce create by an user that does not have
SYS_ADMIN (just for apples-to-apples comparison)
mahesh@mahesh-vm0-artful:~$ sysctl -q kernel.controlled_userns_caps_whitelist
sysctl: cannot stat /proc/sys/kernel/controlled_userns_caps_whitelist:
No such file or directory
mahesh@mahesh-vm0-artful:~$ id
uid=1000(mahesh) gid=1000(mahesh)
groups=1000(mahesh),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),118(lpadmin),128(sambashare)
mahesh@mahesh-vm0-artful:~/Work/Linux$ unshare -Uf -- bash
nobody@mahesh-vm0-artful:~/Work/Linux$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
nobody@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s clean
nobody@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s
real 9m10.115s
user 25m20.984s
sys 2m48.129s


(a1) Now patched the same kernel and built and booted with this new kernel -

mahesh@mahesh-vm0-artful:~$ sysctl -q kernel.controlled_userns_caps_whitelist
kernel.controlled_userns_caps_whitelist = 1f,
mahesh@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s clean
mahesh@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s
real 6m39.964s
user 22m23.538s
sys 2m34.258s

(b1) Now in an user-namespace created by an user that does not have SYS_ADMIN

mahesh@mahesh-vm0-artful:~$ id
uid=1000(mahesh) gid=1000(mahesh)
groups=1000(mahesh),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),118(lpadmin),128(sambashare)
mahesh@mahesh-vm0-artful:~/Work/Linux$ unshare -Uf -- bash
nobody@mahesh-vm0-artful:~/Work/Linux$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
nobody@mahesh-vm0-artful:~/Work/Linux$ make -s clean
nobody@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s
real 6m54.725s
user 23m18.833s
sys 2m38.996s

---

For the http-get test, I used the same 'ran' utility you have proposed
and wrapped inside a script like -
mahesh@mahesh-vm0-artful:~/Work/Scripts$ cat RanLauncher1m.sh
#!/bin/bash
set -v
(sleep 60; killall ran) &
time (cd ~/go/bin; ./ran -i index.html &g

Re: [PATCHv2 2/2] userns: control capabilities of some user namespaces

2017-11-28 Thread महेश बंडेवार
On Tue, Nov 28, 2017 at 3:04 PM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> ...
>> >> diff --git a/security/commoncap.c b/security/commoncap.c
>> >> index fc46f5b85251..89103f16ac37 100644
>> >> --- a/security/commoncap.c
>> >> +++ b/security/commoncap.c
>> >> @@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct 
>> >> user_namespace *targ_ns,
>> >>  {
>> >>   struct user_namespace *ns = targ_ns;
>> >>
>> >> + /* If the capability is controlled and user-ns that process
>> >> +  * belongs-to is 'controlled' then return EPERM and no need
>> >> +  * to check the user-ns hierarchy.
>> >> +  */
>> >> + if (is_user_ns_controlled(cred->user_ns) &&
>> >> + is_capability_controlled(cap))
>> >> + return -EPERM;
>> >
>> > I'd be curious to see the performance impact on this on a regular
>> > workload (kernel build?) in a controlled ns.
>> >
>> Should it affect? If at all, it should be +ve since, the recursive
>> user-ns hierarchy lookup is avoided with the above check if the
>> capability is controlled.
>
> Yes but I expect that to be the rare case for normal lxc installs
> (which are of course what I am interested in)
>
>>  The additional cost otherwise is this check
>> per cap_capable() call.
>
> And pipeline refetching?
>
> Capability calls also shouldn't be all that frequent, but still I'm
> left wondering...

Correct, and capability checks are part of the control-path and not
the data-path so shouldn't matter but I guess it doesn't hurt to
find-out the number. Do you have any workload in mind, that we can use
for this test/benchmark?


Re: [PATCHv2 2/2] userns: control capabilities of some user namespaces

2017-11-28 Thread महेश बंडेवार
On Sat, Nov 25, 2017 at 10:40 PM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (mah...@bandewar.net):
>> From: Mahesh Bandewar 
>>
>> With this new notion of "controlled" user-namespaces, the controlled
>> user-namespaces are marked at the time of their creation while the
>> capabilities of processes that belong to them are controlled using the
>> global mask.
>>
>> Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
>> that belongs to uncontrolled user-ns can create another (child) user-
>> namespace that is uncontrolled. Any other process (that either does
>> not have SYS_ADMIN or belongs to a controlled user-ns) can only
>> create a user-ns that is controlled.
>>
>> global-capability-whitelist (controlled_userns_caps_whitelist) is used
>> at the capability check-time and keeps the semantics for the processes
>> that belong to uncontrolled user-ns as it is. Processes that belong to
>> controlled user-ns however are subjected to different checks-
>>
>>(a) if the capability in question is controlled and process belongs
>>to controlled user-ns, then it's always denied.
>>(b) if the capability in question is NOT controlled then fall back
>>to the traditional check.
>>
>> Signed-off-by: Mahesh Bandewar 
>
> Acked-by: Serge Hallyn 
>
> Although a few comment addition requests below:
>
>> ---
>> v2:
>>   Don't recalculate user-ns flags for every setns() call.
>> v1:
>>   Initial submission.
>>
>>  include/linux/capability.h |  1 +
>>  include/linux/user_namespace.h | 20 
>>  kernel/capability.c|  5 +
>>  kernel/user_namespace.c|  4 
>>  security/commoncap.c   |  8 
>>  5 files changed, 38 insertions(+)
>>
>> diff --git a/include/linux/capability.h b/include/linux/capability.h
>> index 7d79a4689625..a1fd9e460379 100644
>> --- a/include/linux/capability.h
>> +++ b/include/linux/capability.h
>> @@ -251,6 +251,7 @@ extern bool ptracer_capable(struct task_struct *tsk, 
>> struct user_namespace *ns);
>>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
>> cpu_vfs_cap_data *cpu_caps);
>>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>>void __user *buff, size_t *lenp, loff_t 
>> *ppos);
>
> Here and at the definition below, please add a comment explaining
> that a controlled cap is defined as not being in the sysctl.
>
will do in v3.

>> +bool is_capability_controlled(int cap);
>>
>>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
>> size);
>>
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index 3fe714da7f5a..647f825c7b5f 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -23,6 +23,7 @@ struct uid_gid_map {/* 64 bytes -- 1 cache line */
>>  };
>>
>>  #define USERNS_SETGROUPS_ALLOWED 1UL
>> +#define USERNS_CONTROLLED 2UL
>>
>>  #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
>>
>> @@ -103,6 +104,16 @@ static inline void put_user_ns(struct user_namespace 
>> *ns)
>>   __put_user_ns(ns);
>>  }
>>
>
> Please add a comment explaining that a controlled ns
> is one created by a user which did not have CAP_SYS_ADMIN
> (or descended from such an ns).
>
will do in v3.

>> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
>> +{
>> + return ns->flags & USERNS_CONTROLLED;
>> +}
>> +
>> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
>> +{
>> + ns->flags |= USERNS_CONTROLLED;
>> +}
>> +
>>  struct seq_operations;
>>  extern const struct seq_operations proc_uid_seq_operations;
>>  extern const struct seq_operations proc_gid_seq_operations;
>> @@ -161,6 +172,15 @@ static inline struct ns_common *ns_get_owner(struct 
>> ns_common *ns)
>>  {
>>   return ERR_PTR(-EPERM);
>>  }
>> +
>> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
>> +{
>> + return false;
>> +}
>> +
>> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
>> +{
>> +}
>>  #endif
>>
>>  #endif /* _LINUX_USER_H */
>> diff --git a/kernel/capability.c b/kernel/capability.c
>> index 4a859b7d4902..bffe249922de 100644
>> --- a/kernel/capability.c
>> +++ b/kernel/capability.c
>> @@ -511,6 +511,11 @@ bool ptracer_capable(struct task_struct *tsk, struct 
>> user_namespace *ns)
>>  }
>>
>>  /* Controlled-userns capabilities routines */
>> +bool is_capability_controlled(int cap)
>> +{
>> + return !cap_raised(controlled_userns_caps_whitelist, cap);
>> +}
>> +
>>  #ifdef CONFIG_SYSCTL
>>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>>void __user *buff, size_t *lenp, loff_t *ppos)
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index c490f1e4313b..600c7dcb9ff7 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -139,6 +139,10 @@ int create_user_ns(struct

Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-09 Thread महेश बंडेवार
On Fri, Nov 10, 2017 at 1:46 PM, Serge E. Hallyn  wrote:
> Quoting Eric W. Biederman (ebied...@xmission.com):
>> single sandbox.  I am not at all certain that the capabilities is the
>> proper place to limit code reachability.
>
> Right, I keep having this gut feeling that there is another way we
> should be doing that.  Maybe based on ksplice or perf, or maybe more
> based on subsystems.  And I hope someone pursues that.  But I can't put
> my finger on it, and meanwhile the capability checks obviously *are* in
> fact gates...
>
Well, I don't mind if there is a better solution available. The
proposed solution is not adding too much or complex code and using a
bit and a sysctl and will be sitting dormant. When we have complete
solution, this addition should not be a burden to maintain because of
it's non-invasive footprint.

I will push the next version of the patch-set that implements Serge's finding.

Thanks,
--mahesh..

[PS: I'll be soon traveling again and moving to an area where
connectivity will be scarce / unreliable. So please expect lot more
delays in my responses.]

> -serge


Re: [PATCH resend 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-11-09 Thread महेश बंडेवार
On Fri, Nov 10, 2017 at 1:30 PM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> ...
>> >>
>> >>  ==
>> >>
>> >> +controlled_userns_caps_whitelist
>> >> +
>> >> +Capability mask that is whitelisted for "controlled" user namespaces.
>> >> +Any capability that is missing from this mask will not be allowed to
>> >> +any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
>> >> +is not part of this mask, then processes running inside any controlled
>> >> +userns's will not be allowed to perform action that needs CAP_NET_RAW
>> >> +capability. However, processes that are attached to a parent user-ns
>> >> +hierarchy that is *not* controlled and has CAP_NET_RAW can continue
>> >> +performing those actions. User-namespaces are marked "controlled" at
>> >> +the time of their creation based on the capabilities of the creator.
>> >> +A process that does not have CAP_SYS_ADMIN will create user-namespaces
>> >> +that are controlled.
>> >
>> > Hm.  I think that's fine (the way 'controlled' user namespaces are
>> > defined), but that is design decision in itself, and should perhaps be
>> > discussed.
>> >
>> > Did you consider other ways?  What about using CAP_SETPCAP?
>> >
>> I did try other ways e.g. using another bounding-set etc. but
>> eventually settled with this approach because of main two properties -
>
> No, I meant did you try other ways of defining a controlled user
> namespace, other than one which is created by a task lacking
> CAP_SYS_ADMIN?
>
SYS_ADMIN is the capability that has been used for deciding who can or
cannot create namespaces, so didn't want to create another model that
may not be compatible with current model which is well understood
hence no.

> ...
>
>> >> +The value is expressed as two comma separated hex words (u32). This
>> >
>> > Why comma separated?  whitespace ok?  Leading 0x ok?  What is the
>> > default at boot?  (Obviously the patch tells me, I'm asking for it
>> > to be spelled out in the doc)
>> >
>> I tried multiple ways including representing capabilities in
>> string/name form for better readability but didn't want to add
>> additional complexities of dealing with strings and possible
>> string-related-issues for this. Also didn't want to reinvent the new
>> form so settled with something that is widely used (cpu
>> bounding/affinity/irq mapping etc.) and is capable of handling growing
>> bit set (currently 37 but possibly more later).
>
> Ok, thanks.


Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-09 Thread महेश बंडेवार
On Fri, Nov 10, 2017 at 6:58 AM, Eric W. Biederman
 wrote:
> "Mahesh Bandewar (महेश बंडेवार)"  writes:
>
>> [resend response as earlier one failed because of formatting issues]
>>
>> On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn  wrote:
>>>
>>> On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) 
>>> wrote:
>>> > On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
>>> >  wrote:
>>> > > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश 
>>> > > बंडेवार) wrote:
>>> > >> Sorry folks I was traveling and seems like lot happened on this 
>>> > >> thread. :p
>>> > >>
>>> > >> I will try to response few of these comments selectively -
>>> > >>
>>> > >> > The thing that makes me hesitate with this set is that it is a
>>> > >> > permanent new feature to address what (I hope) is a temporary
>>> > >> > problem.
>>> > >> I agree this is permanent new feature but it's not solving a temporary
>>> > >> problem. It's impossible to assess what and when new vulnerability
>>> > >> that could show up. I think Daniel summed it up appropriately in his
>>> > >> response
>>> > >>
>>> > >> > Seems like there are two naive ways to do it, the first being to just
>>> > >> > look at all code under ns_capable() plus code called from there.  It
>>> > >> > seems like looking at the result of that could be fruitful.
>>> > >> This is really hard. The main issue that there were features designed
>>> > >> and developed before user-ns days with an assumption that unprivileged
>>> > >> users will never get certain capabilities which only root user gets.
>>> > >> Now that is not true anymore with user-ns creation with mapping root
>>> > >> for any process. Also at the same time blocking user-ns creation for
>>> > >> eveyone is a big-hammer which is not needed too. So it's not that easy
>>> > >> to just perform a code-walk-though and correct those decisions now.
>>> > >>
>>> > >> > It seems to me that the existing control in
>>> > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct 
>>> > >> > tape
>>> > >> > in that case.
>>> > >> This solution is essentially blocking unprivileged users from using
>>> > >> the user-namespaces entirely. This is not really a solution that can
>>> > >> work. The solution that this patch-set adds allows unprivileged users
>>> > >> to create user-namespaces. Actually the proposed solution is more
>>> > >> fine-grained approach than the unprivileged_userns_clone solution
>>> > >> since you can selectively block capabilities rather than completely
>>> > >> blocking the functionality.
>>> > >
>>> > > I've been talking to Stéphane today about this and we should also keep 
>>> > > in mind
>>> > > that we have:
>>> > >
>>> > > chb@conventiont|~
>>> > >> ls -al /proc/sys/user/
>>> > > total 0
>>> > > dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
>>> > > dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
>>> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
>>> > >
>>> > > These files allow you to limit the number of namespaces that can be 
>>> > > created
>>> > > *per namespace* type. So let's say your system runs a bunch of user 
>>> > > namespaces
>>> > > you can do:
>>> > >
>>> > > chb@conventiont|~
>>> > >> echo 0 > /proc/sys/user/max_user_names

Re: [PATCH resend 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-11-09 Thread महेश बंडेवार
On Fri, Nov 10, 2017 at 2:30 AM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (mah...@bandewar.net):
>> From: Mahesh Bandewar 
>>
>> Add a sysctl variable kernel.controlled_userns_caps_whitelist. This
>
> I understand the arguments in favor of whitelists in most cases for
> security purposes.  But given that you've said the goal here is to
> prevent use of a capability in a user namespace when a CVE has been
> found, a whitelist seems the wrong choice, since
>
> 1. it means that an attacker may through some other means be able
> to add a capability back into the whitelist when you specifically
> wanted to drop it.  With a blacklist, you could say "once a cap has
> been dropped it can never be re-added without rebooting".
> 2. it means by default all capabilities will be denied once the
> switch is pulled which is specifically not what you want in this
> case.
> 3. the admin can't just say "drop CAP_NET_ADMIN", but needs to
> know to echo ~CAP_NET_ADMIN.
>
> Why not make it a blacklist, and once a cap is dropped it can
> never be re-added?
>
Well, I'm not going to deny that blacklist approach would work equally
well but code becomes little simpler when you use the whitelist
approach. especially less complicated when a new capability needs to
be added (not that we add capabilities very often) but that would be
something one would have to pay attention to. However with this
approach I can just the CAP_FULL_SET which is readily available.

Having said that I specifically don't have strong preference in this
regard (whitelist vs. blacklist).

> -serge


Re: [PATCH resend 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-11-09 Thread महेश बंडेवार
On Fri, Nov 10, 2017 at 2:22 AM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (mah...@bandewar.net):
>> From: Mahesh Bandewar 
>>
>> Add a sysctl variable kernel.controlled_userns_caps_whitelist. This
>> takes input as capability mask expressed as two comma separated hex
>> u32 words. The mask, however, is stored in kernel as kernel_cap_t type.
>>
>> Any capabilities that are not part of this mask will be controlled and
>> will not be allowed to processes in controlled user-ns.
>>
>> Signed-off-by: Mahesh Bandewar 
>> ---
>>  Documentation/sysctl/kernel.txt | 21 ++
>>  include/linux/capability.h  |  3 +++
>>  kernel/capability.c | 47 
>> +
>>  kernel/sysctl.c |  5 +
>>  4 files changed, 76 insertions(+)
>>
>> diff --git a/Documentation/sysctl/kernel.txt 
>> b/Documentation/sysctl/kernel.txt
>> index 694968c7523c..a1d39dbae847 100644
>> --- a/Documentation/sysctl/kernel.txt
>> +++ b/Documentation/sysctl/kernel.txt
>> @@ -25,6 +25,7 @@ show up in /proc/sys/kernel:
>>  - bootloader_version  [ X86 only ]
>>  - callhome[ S390 only ]
>>  - cap_last_cap
>> +- controlled_userns_caps_whitelist
>>  - core_pattern
>>  - core_pipe_limit
>>  - core_uses_pid
>> @@ -187,6 +188,26 @@ CAP_LAST_CAP from the kernel.
>>
>>  ==
>>
>> +controlled_userns_caps_whitelist
>> +
>> +Capability mask that is whitelisted for "controlled" user namespaces.
>> +Any capability that is missing from this mask will not be allowed to
>> +any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
>> +is not part of this mask, then processes running inside any controlled
>> +userns's will not be allowed to perform action that needs CAP_NET_RAW
>> +capability. However, processes that are attached to a parent user-ns
>> +hierarchy that is *not* controlled and has CAP_NET_RAW can continue
>> +performing those actions. User-namespaces are marked "controlled" at
>> +the time of their creation based on the capabilities of the creator.
>> +A process that does not have CAP_SYS_ADMIN will create user-namespaces
>> +that are controlled.
>
> Hm.  I think that's fine (the way 'controlled' user namespaces are
> defined), but that is design decision in itself, and should perhaps be
> discussed.
>
> Did you consider other ways?  What about using CAP_SETPCAP?
>
I did try other ways e.g. using another bounding-set etc. but
eventually settled with this approach because of main two properties -
(a) This has creation time settings which can be turned on/off at
runtime (b) the run-time knob actually controls the behavior which can
range from no-op to very-drastic without needing the applications to
change and controlled by admin. Also there are always more than one
ways of solving the problem and there possibly could be better
alternative and I don't deny that. :/

Controlling individual capabilities are going to give very different
experience. So how the behavior of the process going to be for a
specific capability is probably out-of-scope for this patch-set. I
would like to offload that responsibility to the admin, as he/she
would be the best judge and knowledgable of the situation /
environment. This should be used as a tool to gain control.

>> +The value is expressed as two comma separated hex words (u32). This
>
> Why comma separated?  whitespace ok?  Leading 0x ok?  What is the
> default at boot?  (Obviously the patch tells me, I'm asking for it
> to be spelled out in the doc)
>
I tried multiple ways including representing capabilities in
string/name form for better readability but didn't want to add
additional complexities of dealing with strings and possible
string-related-issues for this. Also didn't want to reinvent the new
form so settled with something that is widely used (cpu
bounding/affinity/irq mapping etc.) and is capable of handling growing
bit set (currently 37 but possibly more later).

> Otherwise looks good, thanks!
>
> Serge
>
>> +sysctl is avaialble in init-ns and users with CAP_SYS_ADMIN in init-ns
>> +are allowed to make changes.
>> +
>> +==
>> +
>>  core_pattern:
>>
>>  core_pattern is used to specify a core dumpfile pattern name.
>> diff --git a/include/linux/capability.h b/include/linux/capability.h
>> index b52e278e4744..6c0b9677c03f 100644
>> --- a/include/linux/capability.h
>> +++ b/include/linux/capability.h
>> @@ -13,6 +13,7 @@
>>  #define _LINUX_CAPABILITY_H
>>
>>  #include 
>> +#include 
>>
>>
>>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
>> @@ -247,6 +248,8 @@ extern bool ptracer_capable(struct task_struct *tsk, 
>> struct user_namespace *ns);
>>
>>  /* audit system wants to get cap info from files as well */
>>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
>> cpu_vfs_cap_data *cpu_caps);
>> +int proc_douserns_caps_whitelist(struct ctl_t

Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-09 Thread महेश बंडेवार
On Fri, Nov 10, 2017 at 2:25 AM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (mah...@bandewar.net):
>> From: Mahesh Bandewar 
>>
>> With this new notion of "controlled" user-namespaces, the controlled
>> user-namespaces are marked at the time of their creation while the
>> capabilities of processes that belong to them are controlled using the
>> global mask.
>>
>> Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
>> that belongs to uncontrolled user-ns can create another (child) user-
>> namespace that is uncontrolled. Any other process (that either does
>> not have SYS_ADMIN or belongs to a controlled user-ns) can only
>> create a user-ns that is controlled.
>>
>> global-capability-whitelist (controlled_userns_caps_whitelist) is used
>> at the capability check-time and keeps the semantics for the processes
>> that belong to uncontrolled user-ns as it is. Processes that belong to
>> controlled user-ns however are subjected to different checks-
>>
>>(a) if the capability in question is controlled and process belongs
>>to controlled user-ns, then it's always denied.
>>(b) if the capability in question is NOT controlled then fall back
>>to the traditional check.
>>
>> Signed-off-by: Mahesh Bandewar 
>> ---
>>  include/linux/capability.h |  1 +
>>  include/linux/user_namespace.h | 20 
>>  kernel/capability.c|  5 +
>>  kernel/user_namespace.c|  3 +++
>>  security/commoncap.c   |  8 
>>  5 files changed, 37 insertions(+)
>>
>> diff --git a/include/linux/capability.h b/include/linux/capability.h
>> index 6c0b9677c03f..b8c6cac18658 100644
>> --- a/include/linux/capability.h
>> +++ b/include/linux/capability.h
>> @@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, 
>> struct user_namespace *ns);
>>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
>> cpu_vfs_cap_data *cpu_caps);
>>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>>void __user *buff, size_t *lenp, loff_t 
>> *ppos);
>> +bool is_capability_controlled(int cap);
>>
>>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
>> size);
>>
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index c18e01252346..e890fe81b47e 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -22,6 +22,7 @@ struct uid_gid_map {/* 64 bytes -- 1 cache line */
>>  };
>>
>>  #define USERNS_SETGROUPS_ALLOWED 1UL
>> +#define USERNS_CONTROLLED 2UL
>>
>>  #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
>>
>> @@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace 
>> *ns)
>>   __put_user_ns(ns);
>>  }
>>
>> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
>> +{
>> + return ns->flags & USERNS_CONTROLLED;
>> +}
>> +
>> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
>> +{
>> + ns->flags |= USERNS_CONTROLLED;
>> +}
>> +
>>  struct seq_operations;
>>  extern const struct seq_operations proc_uid_seq_operations;
>>  extern const struct seq_operations proc_gid_seq_operations;
>> @@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct 
>> ns_common *ns)
>>  {
>>   return ERR_PTR(-EPERM);
>>  }
>> +
>> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
>> +{
>> + return false;
>> +}
>> +
>> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
>> +{
>> +}
>>  #endif
>>
>>  #endif /* _LINUX_USER_H */
>> diff --git a/kernel/capability.c b/kernel/capability.c
>> index 62dbe3350c1b..40a38cc4ff43 100644
>> --- a/kernel/capability.c
>> +++ b/kernel/capability.c
>> @@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct 
>> user_namespace *ns)
>>  }
>>
>>  /* Controlled-userns capabilities routines */
>> +bool is_capability_controlled(int cap)
>> +{
>> + return !cap_raised(controlled_userns_caps_whitelist, cap);
>> +}
>> +
>>  #ifdef CONFIG_SYSCTL
>>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>>void __user *buff, size_t *lenp, loff_t *ppos)
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index c490f1e4313b..f393ea5108f0 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct 
>> user_namespace *user_ns)
>>   cred->cap_effective = CAP_FULL_SET;
>>   cred->cap_ambient = CAP_EMPTY_SET;
>>   cred->cap_bset = CAP_FULL_SET;
>> + if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) ||
>> + is_user_ns_controlled(user_ns->parent))
>> + mark_user_ns_controlled(user_ns);
>
> Hm, why do this here, rather than at create_user_ns()? It
> shouldn't be recalculated when someone does setns() should it?
>
You are absolutely right! It 

Re: [PATCH] ipvlan: fix ipv6 outbound device

2017-11-09 Thread महेश बंडेवार
On Thu, Nov 9, 2017 at 9:09 PM,   wrote:
> From: Keefe Liu 
>
> When process the outbound packet of ipv6, we should assign the master
> device to output device other than input device.
>
curious to know, how you discovered this?

> Signed-off-by: Keefe Liu 
Acked-by: Mahesh Bandewar 
> ---
>  drivers/net/ipvlan/ipvlan_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ipvlan/ipvlan_core.c 
> b/drivers/net/ipvlan/ipvlan_core.c
> index 034ae4c..f2a7e92 100644
> --- a/drivers/net/ipvlan/ipvlan_core.c
> +++ b/drivers/net/ipvlan/ipvlan_core.c
> @@ -409,7 +409,7 @@ static int ipvlan_process_v6_outbound(struct sk_buff *skb)
> struct dst_entry *dst;
> int err, ret = NET_XMIT_DROP;
> struct flowi6 fl6 = {
> -   .flowi6_iif = dev->ifindex,
> +   .flowi6_oif = dev->ifindex,
> .daddr = ip6h->daddr,
> .saddr = ip6h->saddr,
> .flowi6_flags = FLOWI_FLAG_ANYSRC,
> --
> 1.8.3.1
>
>


Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-08 Thread महेश बंडेवार
[resend response as earlier one failed because of formatting issues]

On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn  wrote:
>
> On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) 
> wrote:
> > On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
> >  wrote:
> > > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) 
> > > wrote:
> > >> Sorry folks I was traveling and seems like lot happened on this thread. 
> > >> :p
> > >>
> > >> I will try to response few of these comments selectively -
> > >>
> > >> > The thing that makes me hesitate with this set is that it is a
> > >> > permanent new feature to address what (I hope) is a temporary
> > >> > problem.
> > >> I agree this is permanent new feature but it's not solving a temporary
> > >> problem. It's impossible to assess what and when new vulnerability
> > >> that could show up. I think Daniel summed it up appropriately in his
> > >> response
> > >>
> > >> > Seems like there are two naive ways to do it, the first being to just
> > >> > look at all code under ns_capable() plus code called from there.  It
> > >> > seems like looking at the result of that could be fruitful.
> > >> This is really hard. The main issue that there were features designed
> > >> and developed before user-ns days with an assumption that unprivileged
> > >> users will never get certain capabilities which only root user gets.
> > >> Now that is not true anymore with user-ns creation with mapping root
> > >> for any process. Also at the same time blocking user-ns creation for
> > >> eveyone is a big-hammer which is not needed too. So it's not that easy
> > >> to just perform a code-walk-though and correct those decisions now.
> > >>
> > >> > It seems to me that the existing control in
> > >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct 
> > >> > tape
> > >> > in that case.
> > >> This solution is essentially blocking unprivileged users from using
> > >> the user-namespaces entirely. This is not really a solution that can
> > >> work. The solution that this patch-set adds allows unprivileged users
> > >> to create user-namespaces. Actually the proposed solution is more
> > >> fine-grained approach than the unprivileged_userns_clone solution
> > >> since you can selectively block capabilities rather than completely
> > >> blocking the functionality.
> > >
> > > I've been talking to Stéphane today about this and we should also keep in 
> > > mind
> > > that we have:
> > >
> > > chb@conventiont|~
> > >> ls -al /proc/sys/user/
> > > total 0
> > > dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
> > > dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
> > > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
> > >
> > > These files allow you to limit the number of namespaces that can be 
> > > created
> > > *per namespace* type. So let's say your system runs a bunch of user 
> > > namespaces
> > > you can do:
> > >
> > > chb@conventiont|~
> > >> echo 0 > /proc/sys/user/max_user_namespaces
> > >
> > > So that the next time you try to create a user namespaces you'd see:
> > >
> > > chb@conventiont|~
> > >> unshare -U
> > > unshare: unshare failed: No space left on device
> > >
> > > So there's not even a need to upstream a new sysctl since we have ways of
> > > blocking this.
> > >
> > I'm not sure how it's solving the problem that my patch-set is addressing?
> > I agree though that the need for unprivileged_userns_clone sysctl goes
> > away as this is equivalent to setting that sysctl to 0 as you have
> > described above.
>
> oh ri

Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-08 Thread महेश बंडेवार
On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
 wrote:
> On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) 
> wrote:
>> Sorry folks I was traveling and seems like lot happened on this thread. :p
>>
>> I will try to response few of these comments selectively -
>>
>> > The thing that makes me hesitate with this set is that it is a
>> > permanent new feature to address what (I hope) is a temporary
>> > problem.
>> I agree this is permanent new feature but it's not solving a temporary
>> problem. It's impossible to assess what and when new vulnerability
>> that could show up. I think Daniel summed it up appropriately in his
>> response
>>
>> > Seems like there are two naive ways to do it, the first being to just
>> > look at all code under ns_capable() plus code called from there.  It
>> > seems like looking at the result of that could be fruitful.
>> This is really hard. The main issue that there were features designed
>> and developed before user-ns days with an assumption that unprivileged
>> users will never get certain capabilities which only root user gets.
>> Now that is not true anymore with user-ns creation with mapping root
>> for any process. Also at the same time blocking user-ns creation for
>> eveyone is a big-hammer which is not needed too. So it's not that easy
>> to just perform a code-walk-though and correct those decisions now.
>>
>> > It seems to me that the existing control in
>> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
>> > in that case.
>> This solution is essentially blocking unprivileged users from using
>> the user-namespaces entirely. This is not really a solution that can
>> work. The solution that this patch-set adds allows unprivileged users
>> to create user-namespaces. Actually the proposed solution is more
>> fine-grained approach than the unprivileged_userns_clone solution
>> since you can selectively block capabilities rather than completely
>> blocking the functionality.
>
> I've been talking to Stéphane today about this and we should also keep in mind
> that we have:
>
> chb@conventiont|~
>> ls -al /proc/sys/user/
> total 0
> dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
> dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
> -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
>
> These files allow you to limit the number of namespaces that can be created
> *per namespace* type. So let's say your system runs a bunch of user namespaces
> you can do:
>
> chb@conventiont|~
>> echo 0 > /proc/sys/user/max_user_namespaces
>
> So that the next time you try to create a user namespaces you'd see:
>
> chb@conventiont|~
>> unshare -U
> unshare: unshare failed: No space left on device
>
> So there's not even a need to upstream a new sysctl since we have ways of
> blocking this.
>
I'm not sure how it's solving the problem that my patch-set is addressing?
I agree though that the need for unprivileged_userns_clone sysctl goes
away as this is equivalent to setting that sysctl to 0 as you have
described above.
However as I mentioned earlier, blocking processes from creating
user-namespaces is not the solution. Processes should be able to
create namespaces as they are designed but at the same time we need to
have controls to 'contain' them if a need arise. Setting max_no to 0
is not the solution that I'm looking for since it doesn't solve the
problem.

> Also I'd like to point out that a lot of capability checks and actual security
> vulnerabilities are associated with CAP_SYS_ADMIN. So what you likely want to 
> do
> is block CAP_SYS_ADMIN in user namespaces but at this point they become
> basically useless for a lot of interesting use cases. In addition, this patch
> would add another layer of complexity that is - imho - not really warranted
> given what we already have.
I disagree. I'm not sure how this patch is adding complexity? Simply
the functionality is maintained exactly as it is with an extra knob
which allows you to take control back if a situation arise. Once the
kernel is patched for whatever was discovered, life ret

Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-08 Thread महेश बंडेवार
Sorry folks I was traveling and seems like lot happened on this thread. :p

I will try to response few of these comments selectively -

> The thing that makes me hesitate with this set is that it is a
> permanent new feature to address what (I hope) is a temporary
> problem.
I agree this is permanent new feature but it's not solving a temporary
problem. It's impossible to assess what and when new vulnerability
that could show up. I think Daniel summed it up appropriately in his
response

> Seems like there are two naive ways to do it, the first being to just
> look at all code under ns_capable() plus code called from there.  It
> seems like looking at the result of that could be fruitful.
This is really hard. The main issue that there were features designed
and developed before user-ns days with an assumption that unprivileged
users will never get certain capabilities which only root user gets.
Now that is not true anymore with user-ns creation with mapping root
for any process. Also at the same time blocking user-ns creation for
eveyone is a big-hammer which is not needed too. So it's not that easy
to just perform a code-walk-though and correct those decisions now.

> It seems to me that the existing control in
> /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> in that case.
This solution is essentially blocking unprivileged users from using
the user-namespaces entirely. This is not really a solution that can
work. The solution that this patch-set adds allows unprivileged users
to create user-namespaces. Actually the proposed solution is more
fine-grained approach than the unprivileged_userns_clone solution
since you can selectively block capabilities rather than completely
blocking the functionality.

> I meant each task has a perm_cap_bset next to the cap_bset.  So task
> p1 (if it has privilege) can drop CAP_SYS_ADMIN from perm_cap_bset,
> p2 (if it has privilege) can drop CAP_NET_ADMIN.  When p1 creates a
> new user_ns, that init task has its cap_bset set to all caps but
> CAP_SYS_ADMIN.
>
> I think for simplicity perm_cap_bset would *only* affect the filling
> of cap_bset at user namespace creation.  So if you wanted to drop a
> capability from your own cap_bset as well, you'd have to do that
> separately.
My original intention is to reduce the attack surface when
vulnerabilities are discovered / published, but I don't see how this
is solving that issue. Also the reason to have sysctl is to have
simplistic control across the board to contain the situation. If that
is not addressed then we might need some other solution on top of
this.


Re: [PATCH net] bonding: fix slave stuck in BOND_LINK_FAIL state

2017-11-07 Thread महेश बंडेवार
On Tue, Nov 7, 2017 at 2:50 AM, Jay Vosburgh  wrote:
> The bonding miimon logic has a flaw, in that a failure of the
> rtnl_trylock can cause a slave to become permanently stuck in
> BOND_LINK_FAIL state.
>
> The sequence of events to cause this is as follows:
>
> 1) bond_miimon_inspect finds that a slave's link is down, and so
> calls bond_propose_link_state, setting slave->new_link_state to
> BOND_LINK_FAIL, then sets slave->new_link to BOND_LINK_DOWN and returns
> non-zero.
>
> 2) In bond_mii_monitor, the rtnl_trylock fails, and the timer is
> rescheduled.  No change is committed.
>
> 3) bond_miimon_inspect is called again, but this time the slave
> from step 1 has recovered.  slave->new_link is reset to NOCHANGE, and, as
> slave->link was never changed, the switch enters the BOND_LINK_UP case,
> and does nothing.  The pending BOND_LINK_FAIL state from step 1 remains
> pending, as new_link_state is not reset.
>
> 4) The state from step 3 persists until another slave changes link
> state and causes bond_miimon_inspect to return non-zero.  At this point,
> the BOND_LINK_FAIL state change on the slave from steps 1-3 is committed,
> and the slave will remain stuck in BOND_LINK_FAIL state even though it
> is actually link up.
>
> The remedy for this is to initialize new_link_state on each entry
> to bond_miimon_inspect, as is already done with new_link.
>
> Reported-by: Alex Sidorenko 
> Reviewed-by: Jarod Wilson 
> Signed-off-by: Jay Vosburgh 
> Fixes: fb9eb899a6dc ("bonding: handle link transition from FAIL to UP 
> correctly")
Acked-by: Mahesh Bandewar 
> ---
>  drivers/net/bonding/bond_main.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index c99dc59d729b..167434e952da 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -2042,6 +2042,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>
> bond_for_each_slave_rcu(bond, slave, iter) {
> slave->new_link = BOND_LINK_NOCHANGE;
> +   slave->link_new_state = slave->link;
>
> link_state = bond_check_dev_link(bond, slave->dev, 0);
>
> --
> 2.14.1
>


Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-05 Thread महेश बंडेवार
On Sat, Nov 4, 2017 at 4:53 PM, Serge E. Hallyn  wrote:
>
> Quoting Mahesh Bandewar (mah...@bandewar.net):
> > Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
> > that belongs to uncontrolled user-ns can create another (child) user-
> > namespace that is uncontrolled. Any other process (that either does
> > not have SYS_ADMIN or belongs to a controlled user-ns) can only
> > create a user-ns that is controlled.
>
> That's a huge change though.  It means that any system that previously
> used unprivileged containers will need new privileged code (which always
> risks more privilege leaks through the new code) to re-enable what was
> possible without privilege before.  That's a regression.
>
I wouldn't call it a regression since the existing behavior is
preserved as it is if the default-mask is not altered. i.e.
uncontrolled process can create user-ns and have full control inside
that user-ns. The only difference is - as an example if 'something'
comes up which makes a specific capability expose ring-0, so admin can
quickly remove the capability in question from the mask, so that no
untrusted code can exploit that capability until either the kernel is
patched or workloads are sanitized keeping in mind what was
discovered. (I have given some real life example vulnerabilities
published recently about CAP_NET_RAW in the cover letter)

> I'm very much interested in what you want to do,  But it seems like
> it would be worth starting with some automated code analysis that shows
> exactly what code becomes accessible to unprivileged users with user
> namespaces which was accessible to unprivileged users before.  Then we
> can reason about classifying that code and perhaps limiting access to
> some of it.
I would like to look at this as 'a tool' that is available to admins
who can quickly take possible-compromise-situation under-control
probably at the cost of some functionality-loss until kernel is
patched and the mask is restored to default value.

I'm not sure if automated tools could discover anything since these
changes should not alter behavior in any way.


Re: [kernel-hardening] [PATCH 0/2] capability controlled user-namespaces

2017-10-19 Thread महेश बंडेवार
On Mon, Oct 2, 2017 at 11:12 AM, Mahesh Bandewar (महेश बंडेवार)
 wrote:
> On Mon, Oct 2, 2017 at 10:14 AM, Serge E. Hallyn  wrote:
>> Quoting Mahesh Bandewar (mah...@bandewar.net):
>>> From: Mahesh Bandewar 
>>>
>>> [Same as the previous RFC series sent on 9/21]
>>>
>>> TL;DR version
>>> -
>>> Creating a sandbox environment with namespaces is challenging
>>> considering what these sandboxed processes can engage into. e.g.
>>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>>> Current form of user-namespaces, however, if changed a bit can allow
>>> us to create a sandbox environment without locking down user-
>>> namespaces.
>>>
>>> Detailed version
>>> 
>>
>> Hi,
>>
>> still struggling with how I feel about the idea in general.
>>
>> So is the intent mainly that if/when there comes an 0-day which allows
>> users with CAP_NET_ADMIN in any namespace to gain privilege on the host,
>> then this can be used as a stop-gap measure until there is a proper fix?
>>
> Thank for looking at this Serge.
>
> Yes, but at the same time it's not just limited to NET_ADMIN but could
> be any of the current capabilities.
>
>> Otherwise, do you have any guidance for how people should use this?
>>
>> IMO it should be heavily discouraged to use this tool as a regular
>> day to day configuration, as I'm not sure there is any "educated"
>> decision to be made, even by those who are in the know, about what
>> to put in this set.
>>
> I think that really depends on the environment. e.g. in certain
> sandboxes third-part / semi-trusted workload is executed where network
> resource is not used. In that environment I can easily take off
> NET_ADMIN and NET_RAW without affecting anything there. At the same
> time I wont have to worry about 0-day related to these two
> capabilities. I would say the Admins at these places are in the best
> place to decide what they can take-off safely and what they cannot.
> Even if they decide not to take-off anything, having a tool at hand to
> gain control is important when the next 0-day strikes us that can be
> exploited using any of the currently used capabilities.
>
> However, you are absolutely right in terms of using it as a stop-gap
> measure to protect environment until it's fixed and the capability in
> question can not be safely taken off permanently without hampering
> operations.
>
> thanks,
> --mahesh..
>
> [...]

friendly ping.


Re: [RFC] Support for UNARP (RFC 1868)

2017-10-12 Thread महेश बंडेवार
On Thu, Oct 12, 2017 at 4:06 PM, Girish Moodalbail
 wrote:
> Hello Eric,
>
> The basic idea is to mark the ARP entry either FAILED or STALE as soon as we
> can so that the subsequent packets that depend on that ARP entry will take
> the slow path (neigh_resolve_output()).
>
> Say, if base_reachable_time is 30 seconds, then an ARP entry will be in
> reachable state somewhere between 15 to 45 seconds. Assuming the worst case,
> the ARP entry will be in REACHABLE state for 45 seconds and the packets
> continue to traverse the network towards the source machine and gets dropped
> their since the VM has moved to destination machine.
>
> Instead, based on the received UNARP packet if we mark the ARP entry
>
> (a) FAILED
>- we move to INCOMPLETE state and start the address resolution by sending
>  out ARP packets (up to allowed maximum number) until we get ARP
> response
>  back at which point we move the ARP entry state to reachable.
>
> (b) STALE
>- we move to DELAY state and set the next timer to DELAY_PROBE_TIME
>  (1 second) and continue to send queued packets in arp_queue.
>- After 1 sec we move to PROBE state and start the address resolution
> like
>  in the case(a) above.
>
> I was leaning towards (a).
One could arbitrarily increase the stale timeout (by changing no of
probes). So sender
will continue sending traffic to something that has already gone away.
STALE doesn't
mean bad but here the sender is clearly indicating it's going away so
FAILED seems to
be the only logical option.

> Please see in-line..
>
> 
>
>>
>> Hi Girish
>>
>> Your description (or patch title) is misleading. You apparently
>> implement the receive side of the RFC.
>
>
> You are right, it implements only the receive side of the RFC. If this RFC
> is accepted, then we can change arping(8) to generate UNARP requests. We
> could also add an option to ip-address(8) delete subcommand to generate
> UNARP whenever an address is deleted from the interface.
>
>> And the RFC had Proxy ARP in mind.
>>
>> What about security implications ?
>
>
> Yes, this feature will extend the attack surface for L2 networks. However,
> the attack vectors for this feature should be same as that of the gratuitous
> ARP, right? The same attack mitigation techniques for gratuitous ARPs is
> equally applicable here.
>
>> Will TCP flows be terminated, instead
>> of being smoothly migrated (TCP_REPAIR)
>
>
> The TCP flows will not be terminated. Upon receiving UNARP packet, the ARP
> entry will be marked FAILED. The subsequent TCP packets from the client
> (towards that IP) will be queued (the first 3 packets in arp_queue and then
> other packets get dropped) until the IP address is resolved again through
> the slow path neigh_resolve_output().
>
> The slow path marks the entry as INCOMPLETE and will start sending several
> ARP requests (ucast_solicit + app_solicit + mcast_solicit) to resolve the
> IP. If the resolution is successful, then the TCP packets will be sent out.
> If not, we will invalidate the ARP entry and call arp_error_report() on the
> queued packets (which will end up sending ICMP_HOST_UNREACH error). This
> behavior is same as what will occur if TCP server disappears in the middle
> of a connection.
>
>>
>> What about IPv6 ? Or maybe more abruptly, do we still need to add
>> features to IPv4 in 2017,  22 years after this RFC came ? ;)
>
>
> Legit question :). Well one thing I have seen in Networking is that an old
> idea circles back around later and turns out to be useful in new contexts
> and use cases. Like I enumerated in my initial email there are certain use
> cases in Cloud that might benefit from UNARP.
>
It doesn't make sense to have this implemented only for IPv4. At this time if
equivalent IPv6 feature is missing, I don't see it being useful / acceptable.

> regards,
> ~Girish
>
>>
>> Thanks.
>>
>>
>


Re: [kernel-hardening] [PATCH 0/2] capability controlled user-namespaces

2017-10-02 Thread महेश बंडेवार
On Mon, Oct 2, 2017 at 10:14 AM, Serge E. Hallyn  wrote:
> Quoting Mahesh Bandewar (mah...@bandewar.net):
>> From: Mahesh Bandewar 
>>
>> [Same as the previous RFC series sent on 9/21]
>>
>> TL;DR version
>> -
>> Creating a sandbox environment with namespaces is challenging
>> considering what these sandboxed processes can engage into. e.g.
>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>> Current form of user-namespaces, however, if changed a bit can allow
>> us to create a sandbox environment without locking down user-
>> namespaces.
>>
>> Detailed version
>> 
>
> Hi,
>
> still struggling with how I feel about the idea in general.
>
> So is the intent mainly that if/when there comes an 0-day which allows
> users with CAP_NET_ADMIN in any namespace to gain privilege on the host,
> then this can be used as a stop-gap measure until there is a proper fix?
>
Thank for looking at this Serge.

Yes, but at the same time it's not just limited to NET_ADMIN but could
be any of the current capabilities.

> Otherwise, do you have any guidance for how people should use this?
>
> IMO it should be heavily discouraged to use this tool as a regular
> day to day configuration, as I'm not sure there is any "educated"
> decision to be made, even by those who are in the know, about what
> to put in this set.
>
I think that really depends on the environment. e.g. in certain
sandboxes third-part / semi-trusted workload is executed where network
resource is not used. In that environment I can easily take off
NET_ADMIN and NET_RAW without affecting anything there. At the same
time I wont have to worry about 0-day related to these two
capabilities. I would say the Admins at these places are in the best
place to decide what they can take-off safely and what they cannot.
Even if they decide not to take-off anything, having a tool at hand to
gain control is important when the next 0-day strikes us that can be
exploited using any of the currently used capabilities.

However, you are absolutely right in terms of using it as a stop-gap
measure to protect environment until it's fixed and the capability in
question can not be safely taken off permanently without hampering
operations.

thanks,
--mahesh..

[...]


Re: [PATCH next] bonding: speed/duplex update at NETDEV_UP event

2017-09-29 Thread महेश बंडेवार
On Fri, Sep 29, 2017 at 1:08 PM, Stephen Hemminger
 wrote:
> On Wed, 27 Sep 2017 18:03:49 -0700
> Mahesh Bandewar  wrote:
>
>> From: Mahesh Bandewar 
>>
>> Some NIC drivers don't have correct speed/duplex settings at the
>> time they send NETDEV_UP notification and that messes up the
>> bonding state. Especially 802.3ad mode which is very sensitive
>> to these settings. In the current implementation we invoke
>> bond_update_speed_duplex() when we receive NETDEV_UP, however,
>> ignore the return value. If the values we get are invalid
>> (UNKNOWN), then slave gets removed from the aggregator with
>> speed and duplex set to UNKNOWN while link is still marked as UP.
>>
>> This patch fixes this scenario. Also 802.3ad mode is sensitive to
>> these conditions while other modes are not, so making sure that it
>> doesn't change the behavior for other modes.
>>
>> Signed-off-by: Mahesh Bandewar 
>> ---
>>  drivers/net/bonding/bond_main.c | 11 ++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/bonding/bond_main.c 
>> b/drivers/net/bonding/bond_main.c
>> index b7313c1d9dcd..177be373966b 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -3076,7 +3076,16 @@ static int bond_slave_netdev_event(unsigned long 
>> event,
>>   break;
>>   case NETDEV_UP:
>>   case NETDEV_CHANGE:
>> - bond_update_speed_duplex(slave);
>> + /* For 802.3ad mode only:
>> +  * Getting invalid Speed/Duplex values here will put slave
>> +  * in weird state. So mark it as link-down for the time
>> +  * being and let link-monitoring (miimon) set it right when
>> +  * correct speeds/duplex are available.
>> +  */
>> + if (bond_update_speed_duplex(slave) &&
>> + BOND_MODE(bond) == BOND_MODE_8023AD)
>> + slave->link = BOND_LINK_DOWN;
>> +
>>   if (BOND_MODE(bond) == BOND_MODE_8023AD)
>>   bond_3ad_adapter_speed_duplex_changed(slave);
>>   /* Fallthrough */
>
> Then fix the drivers. Trying to workaround it here isn't helping.
>
This is not a workaround. It avoids bonding state being weird.

> The problem is that miimon is not required.  Bonding can be purely
> event driven.
>
really? Here is a code-snippet from bonding itself -

/* reset values for 802.3ad/TLB/ALB */
if (!bond_mode_uses_arp(bond_mode)) {
if (!miimon) {
pr_warn("Warning: miimon must be specified,
otherwise bonding will not detect link failure, speed and duplex which
are essential for 802.3ad operation\n");
pr_warn("Forcing miimon to 100msec\n");
miimon = BOND_DEFAULT_MIIMON;
}
}


Re: [PATCH,v3,net-next 2/2] tun: enable napi_gro_frags() for TUN/TAP driver

2017-09-25 Thread महेश बंडेवार
On Fri, Sep 22, 2017 at 1:49 PM, Petar Penkov  wrote:
> Add a TUN/TAP receive mode that exercises the napi_gro_frags()
> interface. This mode is available only in TAP mode, as the interface
> expects packets with Ethernet headers.
>
> Furthermore, packets follow the layout of the iovec_iter that was
> received. The first iovec is the linear data, and every one after the
> first is a fragment. If there are more fragments than the max number,
> drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
> dissector code and to verify that the header resides in the linear data.
>
> The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
> This is imposed because this mode is intended for testing via tools like
> syzkaller and packetdrill, and the increased flexibility it provides can
> introduce security vulnerabilities. This flag is accepted only if the
> device is in TAP mode and has the IFF_NAPI flag set as well. This is
> done because both of these are explicit requirements for correct
> operation in this mode.
>
> Signed-off-by: Petar Penkov 

Thank you Petar.

Acked-by: Mahesh Bandewar 
> Cc: Eric Dumazet 
> Cc: Mahesh Bandewar 
> Cc: Willem de Bruijn 
> Cc: da...@davemloft.net
> Cc: ppen...@stanford.edu
> ---
>  drivers/net/tun.c   | 134 
> ++--
>  include/uapi/linux/if_tun.h |   1 +
>  2 files changed, 129 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index f16407242b18..9880b3bc8fa5 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -75,6 +75,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>
> @@ -121,7 +122,8 @@ do {  
>   \
>  #define TUN_VNET_BE 0x4000
>
>  #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
> - IFF_MULTI_QUEUE | IFF_NAPI)
> + IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
> +
>  #define GOODCOPY_LEN 128
>
>  #define FLT_EXACT_COUNT 8
> @@ -173,6 +175,7 @@ struct tun_file {
> unsigned int ifindex;
> };
> struct napi_struct napi;
> +   struct mutex napi_mutex;/* Protects access to the above napi 
> */
> struct list_head next;
> struct tun_struct *detached;
> struct skb_array tx_array;
> @@ -277,6 +280,7 @@ static void tun_napi_init(struct tun_struct *tun, struct 
> tun_file *tfile,
> netif_napi_add(tun->dev, &tfile->napi, tun_napi_poll,
>NAPI_POLL_WEIGHT);
> napi_enable(&tfile->napi);
> +   mutex_init(&tfile->napi_mutex);
> }
>  }
>
> @@ -292,6 +296,11 @@ static void tun_napi_del(struct tun_struct *tun, struct 
> tun_file *tfile)
> netif_napi_del(&tfile->napi);
>  }
>
> +static bool tun_napi_frags_enabled(const struct tun_struct *tun)
> +{
> +   return READ_ONCE(tun->flags) & IFF_NAPI_FRAGS;
> +}
> +
>  #ifdef CONFIG_TUN_VNET_CROSS_LE
>  static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
>  {
> @@ -1036,7 +1045,8 @@ static void tun_poll_controller(struct net_device *dev)
>  * supports polling, which enables bridge devices in virt setups to
>  * still use netconsole
>  * If NAPI is enabled, however, we need to schedule polling for all
> -* queues.
> +* queues unless we are using napi_gro_frags(), which we call in
> +* process context and not in NAPI context.
>  */
> struct tun_struct *tun = netdev_priv(dev);
>
> @@ -1044,6 +1054,9 @@ static void tun_poll_controller(struct net_device *dev)
> struct tun_file *tfile;
> int i;
>
> +   if (tun_napi_frags_enabled(tun))
> +   return;
> +
> rcu_read_lock();
> for (i = 0; i < tun->numqueues; i++) {
> tfile = rcu_dereference(tun->tfiles[i]);
> @@ -1266,6 +1279,64 @@ static unsigned int tun_chr_poll(struct file *file, 
> poll_table *wait)
> return mask;
>  }
>
> +static struct sk_buff *tun_napi_alloc_frags(struct tun_file *tfile,
> +   size_t len,
> +   const struct iov_iter *it)
> +{
> +   struct sk_buff *skb;
> +   size_t linear;
> +   int err;
> +   int i;
> +
> +   if (it->nr_segs > MAX_SKB_FRAGS + 1)
> +   return ERR_PTR(-ENOMEM);
> +
> +   local_bh_disable();
> +   skb = napi_get_frags(&tfile->napi);
> +   local_bh_enable();
> +   if (!skb)
> +   return ERR_PTR(-ENOMEM);
> +
> +   linear = iov_iter_single_seg_count(it);
> +   err = __skb_grow(skb, linear);
> +   if (err)
> +   goto free;
> +
> +   skb->len = len;
> +   skb->data_len = len - linear;
> +   skb->truesize += skb->data_len;
> +
> +   for (i = 1; i < it->

Re: [PATCH,v3,net-next 1/2] tun: enable NAPI for TUN/TAP driver

2017-09-25 Thread महेश बंडेवार
On Fri, Sep 22, 2017 at 1:49 PM, Petar Penkov  wrote:
> Changes TUN driver to use napi_gro_receive() upon receiving packets
> rather than netif_rx_ni(). Adds flag IFF_NAPI that enables these
> changes and operation is not affected if the flag is disabled.  SKBs
> are constructed upon packet arrival and are queued to be processed
> later.
>
> The new path was evaluated with a benchmark with the following setup:
> Open two tap devices and a receiver thread that reads in a loop for
> each device. Start one sender thread and pin all threads to different
> CPUs. Send 1M minimum UDP packets to each device and measure sending
> time for each of the sending methods:
> napi_gro_receive(): 4.90s
> netif_rx_ni():  4.90s
> netif_receive_skb():7.20s
>
> Signed-off-by: Petar Penkov 

Thank you Petar.

Acked-by: Mahesh Bandewar 

> Cc: Eric Dumazet 
> Cc: Mahesh Bandewar 
> Cc: Willem de Bruijn 
> Cc: da...@davemloft.net
> Cc: ppen...@stanford.edu
> ---
>  drivers/net/tun.c   | 133 
> +++-
>  include/uapi/linux/if_tun.h |   1 +
>  2 files changed, 119 insertions(+), 15 deletions(-)
>


Re: [PATCH,v2,net-next 1/2] tun: enable NAPI for TUN/TAP driver

2017-09-22 Thread महेश बंडेवार
On Fri, Sep 22, 2017 at 11:03 AM, Willem de Bruijn
 wrote:
> On Fri, Sep 22, 2017 at 1:11 PM, Mahesh Bandewar (महेश बंडेवार)
>  wrote:
>>>  #ifdef CONFIG_TUN_VNET_CROSS_LE
>>>  static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
>>>  {
>>> @@ -541,6 +604,11 @@ static void __tun_detach(struct tun_file *tfile, bool 
>>> clean)
>>>
>>> tun = rtnl_dereference(tfile->tun);
>>>
>>> +   if (tun && clean) {
>>> +   tun_napi_disable(tun, tfile);
>> are we missing synchronize_net() separating disable and del calls?
>
> That is not needed here. napi_disable has its own mechanism for waiting
> until a napi struct is no longer run. netif_napi_del will call synchronize_net
> if needed.
Yes, that will do. Thanks.

> These two calls are made one after the other in quite a few drivers.


Re: [PATCH,v2,net-next 1/2] tun: enable NAPI for TUN/TAP driver

2017-09-22 Thread महेश बंडेवार
>  #ifdef CONFIG_TUN_VNET_CROSS_LE
>  static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
>  {
> @@ -541,6 +604,11 @@ static void __tun_detach(struct tun_file *tfile, bool 
> clean)
>
> tun = rtnl_dereference(tfile->tun);
>
> +   if (tun && clean) {
> +   tun_napi_disable(tun, tfile);
are we missing synchronize_net() separating disable and del calls?
> +   tun_napi_del(tun, tfile);
> +   }
> +
> if (tun && !tfile->detached) {
> u16 index = tfile->queue_index;
> BUG_ON(index >= tun->numqueues);


Re: [PATCH,v2,net-next 2/2] tun: enable napi_gro_frags() for TUN/TAP driver

2017-09-22 Thread महेश बंडेवार
On Fri, Sep 22, 2017 at 7:06 AM, Willem de Bruijn
 wrote:
>> @@ -2061,6 +2174,9 @@ static int tun_set_iff(struct net *net, struct file 
>> *file, struct ifreq *ifr)
>> if (tfile->detached)
>> return -EINVAL;
>>
>> +   if ((ifr->ifr_flags & IFF_NAPI_FRAGS) && !capable(CAP_NET_ADMIN))
>> +   return -EPERM;
>> +
>
> This should perhaps be moved into the !dev branch, directly below the
> ns_capable check.
>
Hmm, does that mean fail only on creation but allow to attach if
exists? That would be wrong, isn't it? Correct me if I'm wrong but we
want to prevent both these scenarios if user does not have sufficient
privileges (i.e. NET_ADMIN in init-ns).

>> dev = __dev_get_by_name(net, ifr->ifr_name);
>> if (dev) {
>> if (ifr->ifr_flags & IFF_TUN_EXCL)
>> @@ -2185,6 +2301,9 @@ static int tun_set_iff(struct net *net, struct file 
>> *file, struct ifreq *ifr)
>> tun->flags = (tun->flags & ~TUN_FEATURES) |
>> (ifr->ifr_flags & TUN_FEATURES);
>>
>> +   if (!(tun->flags & IFF_NAPI) || (tun->flags & TUN_TYPE_MASK) != 
>> IFF_TAP)
>> +   tun->flags = tun->flags & ~IFF_NAPI_FRAGS;
>> +
>
> Similarly, this check only need to be performed in that branch.
> Instead of reverting to non-frags mode, a tun_set_iff with the wrong
> set of flags should probably fail hard.
Yes, agree, wrong set of flags should fail hard and probably be done
before attach or open, no?


Re: [PATCH net] net: bonding: fix tlb_dynamic_lb default value

2017-09-12 Thread महेश बंडेवार
On Tue, Sep 12, 2017 at 5:10 AM, Nikolay Aleksandrov
 wrote:
> Commit 8b426dc54cf4 ("bonding: remove hardcoded value") changed the
> default value for tlb_dynamic_lb which lead to either broken ALB mode
> (since tlb_dynamic_lb can be changed only in TLB) or setting TLB mode
> with tlb_dynamic_lb equal to 0.
> The first issue was recently fixed by setting tlb_dynamic_lb to 1 always
> when switching to ALB mode, but the default value is still wrong and
> we'll enter TLB mode with tlb_dynamic_lb equal to 0 if the mode is
> changed via netlink or sysfs. In order to restore the previous behaviour
> and default value simply remove the mode check around the default param
> initialization for tlb_dynamic_lb which will always set it to 1 as
> before.
>
> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
> Signed-off-by: Nikolay Aleksandrov 
Acked-by: Mahesh Bandewar 
> ---
>  drivers/net/bonding/bond_main.c | 17 +++--
>  1 file changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index fc63992ab0e0..c99dc59d729b 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -4289,7 +4289,7 @@ static int bond_check_params(struct bond_params *params)
> int bond_mode   = BOND_MODE_ROUNDROBIN;
> int xmit_hashtype = BOND_XMIT_POLICY_LAYER2;
> int lacp_fast = 0;
> -   int tlb_dynamic_lb = 0;
> +   int tlb_dynamic_lb;
>
> /* Convert string parameters. */
> if (mode) {
> @@ -4601,16 +4601,13 @@ static int bond_check_params(struct bond_params 
> *params)
> }
> ad_user_port_key = valptr->value;
>
> -   if ((bond_mode == BOND_MODE_TLB) || (bond_mode == BOND_MODE_ALB)) {
> -   bond_opt_initstr(&newval, "default");
> -   valptr = bond_opt_parse(bond_opt_get(BOND_OPT_TLB_DYNAMIC_LB),
> -   &newval);
> -   if (!valptr) {
> -   pr_err("Error: No tlb_dynamic_lb default value");
> -   return -EINVAL;
> -   }
> -   tlb_dynamic_lb = valptr->value;
> +   bond_opt_initstr(&newval, "default");
> +   valptr = bond_opt_parse(bond_opt_get(BOND_OPT_TLB_DYNAMIC_LB), 
> &newval);
> +   if (!valptr) {
> +   pr_err("Error: No tlb_dynamic_lb default value");
> +   return -EINVAL;
> }
> +   tlb_dynamic_lb = valptr->value;
>
> if (lp_interval == 0) {
> pr_warn("Warning: ip_interval must be between 1 and %d, so it 
> was reset to %d\n",
> --
> 2.1.4
>


Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-11 Thread महेश बंडेवार
On Sat, Sep 9, 2017 at 4:28 AM, Nikolay Aleksandrov
 wrote:
> On 07/09/17 01:47, Kosuke Tatsukawa wrote:
>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>> balance-alb mode") tried to fix transmit dynamic load balancing in
>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>> ("bonding: remove hardcoded value").
>>
>> It turned out that my previous patch only fixed the case when
>> balance-alb was specified as bonding module parameter, and not when
>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>> to the default mode of the bonding interface, which happens to be
>> balance-rr.
>>
>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>
>> I didn't add code to change tlb_balance_lb back to the default value for
>> other modes, because "mode" is usually set up only once during
>> initialization, and it's not worthwhile to change the static variable
>> bonding_defaults in bond_main.c to a global variable just for this
>> purpose.
>>
>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>> change that behavior, because the value of tlb_balance_lb can be changed
>> using the sysfs interface for balance-tlb, and I didn't like changing
>> the default value back and forth for balance-tlb.
>>
>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>> is not an intended usage, so there is little use making it writable at
>> this moment.
>>
>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>> Reported-by: Reinis Rozitis 
>> Signed-off-by: Kosuke Tatsukawa 
>> Cc: sta...@vger.kernel.org  # v4.12+
>> ---
>>  drivers/net/bonding/bond_options.c |3 +++
>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>
>
> This fix is simpler and more suitable for -net, it fixes the case where
> we switch to ALB mode with tlb_dynamic_lb = 0. After it is in I'll fix the
> default tlb_dynamic_lb issue and restore the original behaviour.
>
Changing tlb_dyanamic_lb to initialize always is also safe for -net
and can go in before or after this change (no dependency on this
change as such)

> Acked-by: Nikolay Aleksandrov 
Acked-by: Mahesh Bandewar 
>
>


Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-08 Thread महेश बंडेवार
On Fri, Sep 8, 2017 at 7:30 AM, Nikolay Aleksandrov
 wrote:
> On 08/09/17 17:17, Kosuke Tatsukawa wrote:
>> Hi,
>>
>>> On 08/09/17 13:10, Nikolay Aleksandrov wrote:
 On 08/09/17 05:06, Kosuke Tatsukawa wrote:
> Hi,
>
>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>> ("bonding: remove hardcoded value").
>>>
>>> It turned out that my previous patch only fixed the case when
>>> balance-alb was specified as bonding module parameter, and not when
>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>> to the default mode of the bonding interface, which happens to be
>>> balance-rr.
>>>
>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>
>>> I didn't add code to change tlb_balance_lb back to the default value for
>>> other modes, because "mode" is usually set up only once during
>>> initialization, and it's not worthwhile to change the static variable
>>> bonding_defaults in bond_main.c to a global variable just for this
>>> purpose.
>>>
>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>> change that behavior, because the value of tlb_balance_lb can be changed
>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>> the default value back and forth for balance-tlb.
>>>
>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>> is not an intended usage, so there is little use making it writable at
>>> this moment.
>>>
>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>> Reported-by: Reinis Rozitis 
>>> Signed-off-by: Kosuke Tatsukawa 
>>> Cc: sta...@vger.kernel.org  # v4.12+
>>> ---
>>>  drivers/net/bonding/bond_options.c |3 +++
>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>
>>
>> I don't believe this to be the right solution, hardcoding it like this
>> changes user-visible behaviour. The issue is that if someone configured
>> it to be 0 in tlb mode, suddenly it will become 1 and will silently
>> override their config if they switch the mode to alb. Also it robs users
>> from their choice.
>>
>> If you think this should be settable in ALB mode, then IMO you should
>> edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
>> That would also be consistent with how it's handled in TLB mode.
>
> No, I don't think tlb_dynamic_lb should be settable in balance-alb at
> this point.  All the current commits regarding tlb_dynamic_lb are for
> balance-tlb mode, so I don't think balance-alb with tlb_dynamic_lb set
> to 0 is an intended usage.
>
>
>> Going back and looking at your previous fix I'd argue that it is also
>> wrong, you should've removed the mode check altogether to return the
>> original behaviour where the dynamic_lb is set to 1 (enabled) by
>> default and then ALB mode would've had it, of course that would've left
>> the case of setting it to 0 in TLB mode and switching to ALB, but that
>> is a different issue.
>
> Maybe balance-alb shouldn't be dependent on tlb_dynamic_lb.
> tlb_dynamic_lb is referenced in the following functions.
>
>  + bond_do_alb_xmit()  -- Used by both balance-tlb and balance-alb
>  + bond_tlb_xmit()  -- Only used by balance-tlb
>  + bond_open()  -- Used by both balance-tlb and balance-alb
>  + bond_check_params()  -- Used during module initialization
>  + bond_fill_info()  -- Used to get/set value
>  + bond_option_tlb_dynamic_lb_set()  -- Used to get/set value
>  + bonding_show_tlb_dynamic_lb()  -- Used to get/set value
>  + bond_is_nondyn_tlb()  -- Only referenced if balance-tlb mode
>
> The following untested patch adds code to make balance-alb work as if
> tlb_dynamic_lb==1 for the functions which affect balance-alb mode.  It
> also reverts my previous patch.
>
> What do you think about this approach?
> ---
> Kosuke TATSUKAWA  | 1st Platform Software Division
>   | NEC Solution Innovators
>   | ta...@ab.jp.nec.com
>

 Logically the approach looks good, that being said it adds unnecessary 
 tests in
 the fast path, why not just something like the patch below ? That only 
 leaves the
 problem if it is zeroed in TLB and switched 

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-07 Thread महेश बंडेवार
On Thu, Sep 7, 2017 at 5:47 PM, Mahesh Bandewar (महेश बंडेवार)
 wrote:
> On Thu, Sep 7, 2017 at 5:39 PM, Mahesh Bandewar (महेश बंडेवार)
>  wrote:
>> On Thu, Sep 7, 2017 at 4:09 PM, Nikolay Aleksandrov
>>  wrote:
>>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>>> ("bonding: remove hardcoded value").
>>>>
>>>> It turned out that my previous patch only fixed the case when
>>>> balance-alb was specified as bonding module parameter, and not when
>>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>>> to the default mode of the bonding interface, which happens to be
>>>> balance-rr.
>>>>
>>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>>
>>>> I didn't add code to change tlb_balance_lb back to the default value for
>>>> other modes, because "mode" is usually set up only once during
>>>> initialization, and it's not worthwhile to change the static variable
>>>> bonding_defaults in bond_main.c to a global variable just for this
>>>> purpose.
>>>>
>>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>>> change that behavior, because the value of tlb_balance_lb can be changed
>>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>>> the default value back and forth for balance-tlb.
>>>>
>>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>>> is not an intended usage, so there is little use making it writable at
>>>> this moment.
>>>>
>>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>
>> This is wrong! I think you are confusing various things here. ALB is
>> different mode from TLB and TLB-dynamic-lb is *only* a special case of
>> TLB. Your earlier patch is also wrong for the same reasons. However,
>> since the default value of tlb_dynamic_lb is set to 0  it's not
>> causing issues for ALB mode otherwise it would break ALB mode.
> I take this back. The default value is 1 so ALB is broken because of
> the referenced change.
>
>> tlb_dynamic_lb has absolutely nothing to do with ALB mode. Please
>> revert the earlier change (cbf5ecb30560).
>>
>> It's not clear to me what you saw as broken, so can't really suggest
>> what really need to be done.
Please scratch all I said.


  1   2   >