Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-09 Thread Cong Wang
On Tue, Mar 8, 2016 at 10:42 AM, Mahesh Bandewar  wrote:
>> The more subsystems involves, the more struct net pointers you
>> potentially need to touch, the less likely you can make it correct
>> by just switching skb->dev.
>
> Please drop that prejudice and read the patch-set carefully. I'm
> neither changing
> any *net* pointers nor changing the skb->dev pointers anywhere. All I'm saying
> is dev->l3_dev is what we'll use for *all* L3 processing and no need to change
> skb->dev or net-ns of any device(s) involved.

Please don't misinterpret me.

I said _switching_, not overwriting or changing, you use dev->l3_dev to _switch_
skb->dev and/or net, this is exactly what I am complaining about. This is not
how netns works.


Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-08 Thread Mahesh Bandewar
On Mon, Mar 7, 2016 at 9:37 PM, Cong Wang  wrote:
> On Fri, Mar 4, 2016 at 2:12 PM, Mahesh Bandewar  wrote:
>>
>> Unfortunately we don't have a way to switch to ns after L3 processing.
>
> I am totally aware of this, this is exactly why I said this might not be easy.
>
> The question is how hard it is to implement one? My gut feeling is we only
> need to change some data in skb, something similar to skb_scrub_packet().
> But I never even try.
>
>> So I would
>> argue it the other-way around. The way it is now; breaks the _isolation_ 
>> model.
>> If the default-ns is responsible for whole L3 (in this situation) and
>> it does pretty well
>> on egress but there is no way to do that in ingress path. IPtables is
>> not the only thing,
>> how about routing, how about IPsec? None of this will function well.
>> So we need to
>> have a generic solution to solve all these problems.
>
> I don't understand why you question me this, it is you who only cares
> about iptables from your cover letter for this patchset, not me.
>
calm down! I'm not questioning you or anyone. There is problem and I would
prefer to solve it properly without cooking-up some short-term hack. The problem
reported is about IPtable and when I looked at it, I felt that it's
just a matter of
time until someone tries / uses IPsec or some routing. So I created
this solution.
I did mention about *complete* L3 processing in the cover letter but
agree that I
emphasized only on IPtables which I can correct it in the next version.

> The more subsystems involves, the more struct net pointers you
> potentially need to touch, the less likely you can make it correct
> by just switching skb->dev.

Please drop that prejudice and read the patch-set carefully. I'm
neither changing
any *net* pointers nor changing the skb->dev pointers anywhere. All I'm saying
is dev->l3_dev is what we'll use for *all* L3 processing and no need to change
skb->dev or net-ns of any device(s) involved.


Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-08 Thread Nicolas Dichtel

+ Eric B.

Le 08/03/2016 06:37, Cong Wang a écrit :

On Fri, Mar 4, 2016 at 2:12 PM, Mahesh Bandewar  wrote:


Unfortunately we don't have a way to switch to ns after L3 processing.


I am totally aware of this, this is exactly why I said this might not be easy.

The question is how hard it is to implement one? My gut feeling is we only
need to change some data in skb, something similar to skb_scrub_packet().
But I never even try.


Note that Eric Biederman has made "some" patches to be able to "control" the
netns on the egress side. The goal is to be able to have routes that leak to
another netns.
It seems that the same work should probably be done at the ingress side.

I agree with Cong that just changing skb->dev is not the right approach.


Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-07 Thread Cong Wang
On Fri, Mar 4, 2016 at 2:12 PM, Mahesh Bandewar  wrote:
>
> Unfortunately we don't have a way to switch to ns after L3 processing.

I am totally aware of this, this is exactly why I said this might not be easy.

The question is how hard it is to implement one? My gut feeling is we only
need to change some data in skb, something similar to skb_scrub_packet().
But I never even try.

> So I would
> argue it the other-way around. The way it is now; breaks the _isolation_ 
> model.
> If the default-ns is responsible for whole L3 (in this situation) and
> it does pretty well
> on egress but there is no way to do that in ingress path. IPtables is
> not the only thing,
> how about routing, how about IPsec? None of this will function well.
> So we need to
> have a generic solution to solve all these problems.

I don't understand why you question me this, it is you who only cares
about iptables from your cover letter for this patchset, not me.

The more subsystems involves, the more struct net pointers you
potentially need to touch, the less likely you can make it correct
by just switching skb->dev.


Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-04 Thread Mahesh Bandewar
On Fri, Mar 4, 2016 at 9:30 AM, Cong Wang  wrote:
> On Thu, Mar 3, 2016 at 5:42 PM, Mahesh Bandewar  wrote:
> As you mentioned logically we should be able to pass the skb in master's
> ns
> until L3 processing is completed. This patch series attempts to do that by
> disassociating this logic from skb->dev and adding it to l3_dev. This
> should
> include not just IPT but all that is done in L3 phase (IPT, routing etc.)
> Also since dev->l3_dev is same as dev, this should not break any existing
> logic.
>
 Well, looking at the code I realized that I missed few places (especially
 routing
 logic) which continues using skb->dev in ingress path and should be
 corrected to
 use l3_dev. I'll correct those places and send the next version.
>>>
>>>
>>> Look, even you yourself are missing something here. ;) This is exactly why
>>> I suggest to consider another approach. Please don't introduce any code
>>> that is hard to debug even for yourself. The struct net pointer is passed
>>> around in kernel network subsystem almost everywhere, it is not easy to make
>>> it bug-free by just switching skb->dev.
>>>
>> I disagree. Conceptually this is very easy to understand as we are taking L3
>> processing off of skb->dev and loading it onto dev->l3_dev. So
>> everything that is
>> associated with l3_dev is for L3. This should neither make debugging harder
>> nor add complicated code.
>
> Nope, conceptually it is not right, that breaks _isolation_ in
> concept, we need to
> make each skb really traverse in the original stack, not just
> switching skb->dev,
> that is cheating. It is just too easy to hide bugs in your way, we
> never use network
> namespace in this way before.

Unfortunately we don't have a way to switch to ns after L3 processing.
So I would
argue it the other-way around. The way it is now; breaks the _isolation_ model.
If the default-ns is responsible for whole L3 (in this situation) and
it does pretty well
on egress but there is no way to do that in ingress path. IPtables is
not the only thing,
how about routing, how about IPsec? None of this will function well.
So we need to
have a generic solution to solve all these problems.


Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-04 Thread Cong Wang
On Thu, Mar 3, 2016 at 5:42 PM, Mahesh Bandewar  wrote:
 As you mentioned logically we should be able to pass the skb in master's
 ns
 until L3 processing is completed. This patch series attempts to do that by
 disassociating this logic from skb->dev and adding it to l3_dev. This
 should
 include not just IPT but all that is done in L3 phase (IPT, routing etc.)
 Also since dev->l3_dev is same as dev, this should not break any existing
 logic.

>>> Well, looking at the code I realized that I missed few places (especially
>>> routing
>>> logic) which continues using skb->dev in ingress path and should be
>>> corrected to
>>> use l3_dev. I'll correct those places and send the next version.
>>
>>
>> Look, even you yourself are missing something here. ;) This is exactly why
>> I suggest to consider another approach. Please don't introduce any code
>> that is hard to debug even for yourself. The struct net pointer is passed
>> around in kernel network subsystem almost everywhere, it is not easy to make
>> it bug-free by just switching skb->dev.
>>
> I disagree. Conceptually this is very easy to understand as we are taking L3
> processing off of skb->dev and loading it onto dev->l3_dev. So
> everything that is
> associated with l3_dev is for L3. This should neither make debugging harder
> nor add complicated code.

Nope, conceptually it is not right, that breaks _isolation_ in
concept, we need to
make each skb really traverse in the original stack, not just
switching skb->dev,
that is cheating. It is just too easy to hide bugs in your way, we
never use network
namespace in this way before.


Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-03 Thread Mahesh Bandewar
>>> As you mentioned logically we should be able to pass the skb in master's
>>> ns
>>> until L3 processing is completed. This patch series attempts to do that by
>>> disassociating this logic from skb->dev and adding it to l3_dev. This
>>> should
>>> include not just IPT but all that is done in L3 phase (IPT, routing etc.)
>>> Also since dev->l3_dev is same as dev, this should not break any existing
>>> logic.
>>>
>> Well, looking at the code I realized that I missed few places (especially
>> routing
>> logic) which continues using skb->dev in ingress path and should be
>> corrected to
>> use l3_dev. I'll correct those places and send the next version.
>
>
> Look, even you yourself are missing something here. ;) This is exactly why
> I suggest to consider another approach. Please don't introduce any code
> that is hard to debug even for yourself. The struct net pointer is passed
> around in kernel network subsystem almost everywhere, it is not easy to make
> it bug-free by just switching skb->dev.
>
I disagree. Conceptually this is very easy to understand as we are taking L3
processing off of skb->dev and loading it onto dev->l3_dev. So
everything that is
associated with l3_dev is for L3. This should neither make debugging harder
nor add complicated code.

>
>>> That's the generic implementation as far as the stack is concerned and
>>> IPvlan
>>> uses it to make the IPT hooks symmetric.
>>>
>>> Another IPT hook may be good enough (however I haven't
>>> given much thought to it) for IPvlan, but this generic approach will be
>>> for
>>> whole of L3. Also currently this I have implemented for the ingress path
>>> but that does not mean the same cannot be extended for the egress path
>>> (in fact I'm thinking about that)
>>
>
> This is logically correct and easier to understand or debug, since IPT hook
> is very common in network subsystem even ingress qdisc uses it.


Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-03 Thread Cong Wang
On Thu, Mar 3, 2016 at 3:21 PM, Mahesh Bandewar  wrote:
>
>
> On Thu, Mar 3, 2016 at 1:43 PM, Mahesh Bandewar  wrote:
>>
>> On Wed, Mar 2, 2016 at 8:45 PM, Cong Wang 
>> wrote:
>> >
>> > On Mon, Feb 29, 2016 at 2:08 PM, Mahesh Bandewar 
>> > wrote:
>> > > From: Mahesh Bandewar 
>> > >
>> > > netif_receive_skb_core() dispatcher uses skb->dev device to send it
>> > > to the packet-handlers (e.g. ip_rcv, ipv6_rcv etc). These packet
>> > > handlers intern use the device passed to determine the net-ns to
>> > > further process these packets.  Now with the nomination logic, the
>> > > dispatcher will call netif_get_l3_dev() helper to select the device
>> > > to be used for this processing. Since l3_dev is initialized to self,
>> > > normal packet processing should not change.
>> > >
>> >
>> > So, if I understand your patches correctly, _logically_ the skb is still
>> > passed into the slave's netns via dev_forward_skb() but now goes over
>> > the iptable rules from the default netns by only changing the netns
>> > parameter to these hooks?
>> >
>> We are using different dev pointer for L3 processing than skb->dev. All
>> netns, routing etc, associated with this dev (l3_dev) should be used for
>> L3.
>>
>> > That is ugly... Logically, you should still need to continue to pass
>> > the skb upper to the stack in default netns until
>> > ip_local_deliver_finish().
>> >
>> >
>> > So, how about adding an iptable hook in ipvlan so that skb will
>> > continue traverse in the original stack and then moved into slave's
>> > netns? This might be harder since logically we need an L3 entrance
>> > to the stack.
>> >
>> > Thoughts?
>>
>> As you mentioned logically we should be able to pass the skb in master's
>> ns
>> until L3 processing is completed. This patch series attempts to do that by
>> disassociating this logic from skb->dev and adding it to l3_dev. This
>> should
>> include not just IPT but all that is done in L3 phase (IPT, routing etc.)
>> Also since dev->l3_dev is same as dev, this should not break any existing
>> logic.
>>
> Well, looking at the code I realized that I missed few places (especially
> routing
> logic) which continues using skb->dev in ingress path and should be
> corrected to
> use l3_dev. I'll correct those places and send the next version.


Look, even you yourself are missing something here. ;) This is exactly why
I suggest to consider another approach. Please don't introduce any code
that is hard to debug even for yourself. The struct net pointer is passed
around in kernel network subsystem almost everywhere, it is not easy to make
it bug-free by just switching skb->dev.


>> That's the generic implementation as far as the stack is concerned and
>> IPvlan
>> uses it to make the IPT hooks symmetric.
>>
>> Another IPT hook may be good enough (however I haven't
>> given much thought to it) for IPvlan, but this generic approach will be
>> for
>> whole of L3. Also currently this I have implemented for the ingress path
>> but that does not mean the same cannot be extended for the egress path
>> (in fact I'm thinking about that)
>

This is logically correct and easier to understand or debug, since IPT hook
is very common in network subsystem even ingress qdisc uses it.


Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-03 Thread Mahesh Bandewar
On Wed, Mar 2, 2016 at 8:45 PM, Cong Wang  wrote:
>
> On Mon, Feb 29, 2016 at 2:08 PM, Mahesh Bandewar  wrote:
> > From: Mahesh Bandewar 
> >
> > netif_receive_skb_core() dispatcher uses skb->dev device to send it
> > to the packet-handlers (e.g. ip_rcv, ipv6_rcv etc). These packet
> > handlers intern use the device passed to determine the net-ns to
> > further process these packets.  Now with the nomination logic, the
> > dispatcher will call netif_get_l3_dev() helper to select the device
> > to be used for this processing. Since l3_dev is initialized to self,
> > normal packet processing should not change.
> >
>
> So, if I understand your patches correctly, _logically_ the skb is still
> passed into the slave's netns via dev_forward_skb() but now goes over
> the iptable rules from the default netns by only changing the netns
> parameter to these hooks?
>
We are using different dev pointer for L3 processing than skb->dev. All
netns, routing etc, associated with this dev (l3_dev) should be used for L3.

> That is ugly... Logically, you should still need to continue to pass
> the skb upper to the stack in default netns until ip_local_deliver_finish().
>
>
> So, how about adding an iptable hook in ipvlan so that skb will
> continue traverse in the original stack and then moved into slave's
> netns? This might be harder since logically we need an L3 entrance
> to the stack.
>
> Thoughts?

As you mentioned logically we should be able to pass the skb in master's ns
until L3 processing is completed. This patch series attempts to do that by
disassociating this logic from skb->dev and adding it to l3_dev. This should
include not just IPT but all that is done in L3 phase (IPT, routing etc.)
Also since dev->l3_dev is same as dev, this should not break any existing logic.

That's the generic implementation as far as the stack is concerned and IPvlan
uses it to make the IPT hooks symmetric.

Another IPT hook may be good enough (however I haven't
given much thought to it) for IPvlan, but this generic approach will be for
whole of L3. Also currently this I have implemented for the ingress path
but that does not mean the same cannot be extended for the egress path
(in fact I'm thinking about that)


Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-02 Thread Cong Wang
On Mon, Feb 29, 2016 at 2:08 PM, Mahesh Bandewar  wrote:
> From: Mahesh Bandewar 
>
> netif_receive_skb_core() dispatcher uses skb->dev device to send it
> to the packet-handlers (e.g. ip_rcv, ipv6_rcv etc). These packet
> handlers intern use the device passed to determine the net-ns to
> further process these packets.  Now with the nomination logic, the
> dispatcher will call netif_get_l3_dev() helper to select the device
> to be used for this processing. Since l3_dev is initialized to self,
> normal packet processing should not change.
>

So, if I understand your patches correctly, _logically_ the skb is still
passed into the slave's netns via dev_forward_skb() but now goes over
the iptable rules from the default netns by only changing the netns
parameter to these hooks?

That is ugly... Logically, you should still need to continue to pass
the skb upper to the stack in default netns until ip_local_deliver_finish().

So, how about adding an iptable hook in ipvlan so that skb will
continue traverse in the original stack and then moved into slave's
netns? This might be harder since logically we need an L3 entrance
to the stack.

Thoughts?


[PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-02-29 Thread Mahesh Bandewar
From: Mahesh Bandewar 

netif_receive_skb_core() dispatcher uses skb->dev device to send it
to the packet-handlers (e.g. ip_rcv, ipv6_rcv etc). These packet
handlers intern use the device passed to determine the net-ns to
further process these packets.  Now with the nomination logic, the
dispatcher will call netif_get_l3_dev() helper to select the device
to be used for this processing. Since l3_dev is initialized to self,
normal packet processing should not change.

Signed-off-by: Mahesh Bandewar 
CC: Eric Dumazet 
CC: Tim Hockin 
CC: Alex Pollitt 
CC: Matthew Dupre 
---
 net/core/dev.c   | 9 ++---
 net/ipv4/ip_input.c  | 5 +++--
 net/ipv6/ip6_input.c | 5 +++--
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c4023a68cdc1..9252436ef11a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1811,7 +1811,8 @@ static inline int deliver_skb(struct sk_buff *skb,
if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
return -ENOMEM;
atomic_inc(&skb->users);
-   return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   return pt_prev->func(skb, netif_get_l3_dev(skb->dev), pt_prev,
+orig_dev);
 }
 
 static inline void deliver_ptype_list_skb(struct sk_buff *skb,
@@ -1904,7 +1905,8 @@ again:
}
 out_unlock:
if (pt_prev)
-   pt_prev->func(skb2, skb->dev, pt_prev, skb->dev);
+   pt_prev->func(skb2, netif_get_l3_dev(skb->dev), pt_prev,
+ skb->dev);
rcu_read_unlock();
 }
 
@@ -4157,7 +4159,8 @@ ncls:
if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
goto drop;
else
-   ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   ret = pt_prev->func(skb, netif_get_l3_dev(skb->dev),
+   pt_prev, orig_dev);
} else {
 drop:
if (!deliver_exact)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index e3d782746d9d..b47164e3e1c6 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -247,7 +247,8 @@ int ip_local_deliver(struct sk_buff *skb)
/*
 *  Reassemble IP fragments.
 */
-   struct net *net = dev_net(skb->dev);
+   struct net_device *dev = netif_get_l3_dev(skb->dev);
+   struct net *net = dev_net(dev);
 
if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER))
@@ -255,7 +256,7 @@ int ip_local_deliver(struct sk_buff *skb)
}
 
return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
-  net, NULL, skb, skb->dev, NULL,
+  net, NULL, skb, dev, NULL,
   ip_local_deliver_finish);
 }
 
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index c05c425c2389..88443ac06402 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -287,9 +287,10 @@ discard:
 
 int ip6_input(struct sk_buff *skb)
 {
+   struct net_device *dev = netif_get_l3_dev(skb->dev);
+
return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN,
-  dev_net(skb->dev), NULL, skb, skb->dev, NULL,
-  ip6_input_finish);
+  dev_net(dev), NULL, skb, dev, NULL, ip6_input_finish);
 }
 
 int ip6_mc_input(struct sk_buff *skb)
-- 
2.7.0.rc3.207.g0ac5344