Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
On Tue, Mar 8, 2016 at 10:42 AM, Mahesh Bandewar wrote: >> The more subsystems involves, the more struct net pointers you >> potentially need to touch, the less likely you can make it correct >> by just switching skb->dev. > > Please drop that prejudice and read the patch-set carefully. I'm > neither changing > any *net* pointers nor changing the skb->dev pointers anywhere. All I'm saying > is dev->l3_dev is what we'll use for *all* L3 processing and no need to change > skb->dev or net-ns of any device(s) involved. Please don't misinterpret me. I said _switching_, not overwriting or changing, you use dev->l3_dev to _switch_ skb->dev and/or net, this is exactly what I am complaining about. This is not how netns works.
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
On Mon, Mar 7, 2016 at 9:37 PM, Cong Wang wrote: > On Fri, Mar 4, 2016 at 2:12 PM, Mahesh Bandewar wrote: >> >> Unfortunately we don't have a way to switch to ns after L3 processing. > > I am totally aware of this, this is exactly why I said this might not be easy. > > The question is how hard it is to implement one? My gut feeling is we only > need to change some data in skb, something similar to skb_scrub_packet(). > But I never even try. > >> So I would >> argue it the other-way around. The way it is now; breaks the _isolation_ >> model. >> If the default-ns is responsible for whole L3 (in this situation) and >> it does pretty well >> on egress but there is no way to do that in ingress path. IPtables is >> not the only thing, >> how about routing, how about IPsec? None of this will function well. >> So we need to >> have a generic solution to solve all these problems. > > I don't understand why you question me this, it is you who only cares > about iptables from your cover letter for this patchset, not me. > calm down! I'm not questioning you or anyone. There is problem and I would prefer to solve it properly without cooking-up some short-term hack. The problem reported is about IPtable and when I looked at it, I felt that it's just a matter of time until someone tries / uses IPsec or some routing. So I created this solution. I did mention about *complete* L3 processing in the cover letter but agree that I emphasized only on IPtables which I can correct it in the next version. > The more subsystems involves, the more struct net pointers you > potentially need to touch, the less likely you can make it correct > by just switching skb->dev. Please drop that prejudice and read the patch-set carefully. I'm neither changing any *net* pointers nor changing the skb->dev pointers anywhere. All I'm saying is dev->l3_dev is what we'll use for *all* L3 processing and no need to change skb->dev or net-ns of any device(s) involved.
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
+ Eric B. Le 08/03/2016 06:37, Cong Wang a écrit : On Fri, Mar 4, 2016 at 2:12 PM, Mahesh Bandewar wrote: Unfortunately we don't have a way to switch to ns after L3 processing. I am totally aware of this, this is exactly why I said this might not be easy. The question is how hard it is to implement one? My gut feeling is we only need to change some data in skb, something similar to skb_scrub_packet(). But I never even try. Note that Eric Biederman has made "some" patches to be able to "control" the netns on the egress side. The goal is to be able to have routes that leak to another netns. It seems that the same work should probably be done at the ingress side. I agree with Cong that just changing skb->dev is not the right approach.
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
On Fri, Mar 4, 2016 at 2:12 PM, Mahesh Bandewar wrote: > > Unfortunately we don't have a way to switch to ns after L3 processing. I am totally aware of this, this is exactly why I said this might not be easy. The question is how hard it is to implement one? My gut feeling is we only need to change some data in skb, something similar to skb_scrub_packet(). But I never even try. > So I would > argue it the other-way around. The way it is now; breaks the _isolation_ > model. > If the default-ns is responsible for whole L3 (in this situation) and > it does pretty well > on egress but there is no way to do that in ingress path. IPtables is > not the only thing, > how about routing, how about IPsec? None of this will function well. > So we need to > have a generic solution to solve all these problems. I don't understand why you question me this, it is you who only cares about iptables from your cover letter for this patchset, not me. The more subsystems involves, the more struct net pointers you potentially need to touch, the less likely you can make it correct by just switching skb->dev.
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
On Fri, Mar 4, 2016 at 9:30 AM, Cong Wang wrote: > On Thu, Mar 3, 2016 at 5:42 PM, Mahesh Bandewar wrote: > As you mentioned logically we should be able to pass the skb in master's > ns > until L3 processing is completed. This patch series attempts to do that by > disassociating this logic from skb->dev and adding it to l3_dev. This > should > include not just IPT but all that is done in L3 phase (IPT, routing etc.) > Also since dev->l3_dev is same as dev, this should not break any existing > logic. > Well, looking at the code I realized that I missed few places (especially routing logic) which continues using skb->dev in ingress path and should be corrected to use l3_dev. I'll correct those places and send the next version. >>> >>> >>> Look, even you yourself are missing something here. ;) This is exactly why >>> I suggest to consider another approach. Please don't introduce any code >>> that is hard to debug even for yourself. The struct net pointer is passed >>> around in kernel network subsystem almost everywhere, it is not easy to make >>> it bug-free by just switching skb->dev. >>> >> I disagree. Conceptually this is very easy to understand as we are taking L3 >> processing off of skb->dev and loading it onto dev->l3_dev. So >> everything that is >> associated with l3_dev is for L3. This should neither make debugging harder >> nor add complicated code. > > Nope, conceptually it is not right, that breaks _isolation_ in > concept, we need to > make each skb really traverse in the original stack, not just > switching skb->dev, > that is cheating. It is just too easy to hide bugs in your way, we > never use network > namespace in this way before. Unfortunately we don't have a way to switch to ns after L3 processing. So I would argue it the other-way around. The way it is now; breaks the _isolation_ model. If the default-ns is responsible for whole L3 (in this situation) and it does pretty well on egress but there is no way to do that in ingress path. IPtables is not the only thing, how about routing, how about IPsec? None of this will function well. So we need to have a generic solution to solve all these problems.
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
On Thu, Mar 3, 2016 at 5:42 PM, Mahesh Bandewar wrote: As you mentioned logically we should be able to pass the skb in master's ns until L3 processing is completed. This patch series attempts to do that by disassociating this logic from skb->dev and adding it to l3_dev. This should include not just IPT but all that is done in L3 phase (IPT, routing etc.) Also since dev->l3_dev is same as dev, this should not break any existing logic. >>> Well, looking at the code I realized that I missed few places (especially >>> routing >>> logic) which continues using skb->dev in ingress path and should be >>> corrected to >>> use l3_dev. I'll correct those places and send the next version. >> >> >> Look, even you yourself are missing something here. ;) This is exactly why >> I suggest to consider another approach. Please don't introduce any code >> that is hard to debug even for yourself. The struct net pointer is passed >> around in kernel network subsystem almost everywhere, it is not easy to make >> it bug-free by just switching skb->dev. >> > I disagree. Conceptually this is very easy to understand as we are taking L3 > processing off of skb->dev and loading it onto dev->l3_dev. So > everything that is > associated with l3_dev is for L3. This should neither make debugging harder > nor add complicated code. Nope, conceptually it is not right, that breaks _isolation_ in concept, we need to make each skb really traverse in the original stack, not just switching skb->dev, that is cheating. It is just too easy to hide bugs in your way, we never use network namespace in this way before.
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
>>> As you mentioned logically we should be able to pass the skb in master's >>> ns >>> until L3 processing is completed. This patch series attempts to do that by >>> disassociating this logic from skb->dev and adding it to l3_dev. This >>> should >>> include not just IPT but all that is done in L3 phase (IPT, routing etc.) >>> Also since dev->l3_dev is same as dev, this should not break any existing >>> logic. >>> >> Well, looking at the code I realized that I missed few places (especially >> routing >> logic) which continues using skb->dev in ingress path and should be >> corrected to >> use l3_dev. I'll correct those places and send the next version. > > > Look, even you yourself are missing something here. ;) This is exactly why > I suggest to consider another approach. Please don't introduce any code > that is hard to debug even for yourself. The struct net pointer is passed > around in kernel network subsystem almost everywhere, it is not easy to make > it bug-free by just switching skb->dev. > I disagree. Conceptually this is very easy to understand as we are taking L3 processing off of skb->dev and loading it onto dev->l3_dev. So everything that is associated with l3_dev is for L3. This should neither make debugging harder nor add complicated code. > >>> That's the generic implementation as far as the stack is concerned and >>> IPvlan >>> uses it to make the IPT hooks symmetric. >>> >>> Another IPT hook may be good enough (however I haven't >>> given much thought to it) for IPvlan, but this generic approach will be >>> for >>> whole of L3. Also currently this I have implemented for the ingress path >>> but that does not mean the same cannot be extended for the egress path >>> (in fact I'm thinking about that) >> > > This is logically correct and easier to understand or debug, since IPT hook > is very common in network subsystem even ingress qdisc uses it.
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
On Thu, Mar 3, 2016 at 3:21 PM, Mahesh Bandewar wrote: > > > On Thu, Mar 3, 2016 at 1:43 PM, Mahesh Bandewar wrote: >> >> On Wed, Mar 2, 2016 at 8:45 PM, Cong Wang >> wrote: >> > >> > On Mon, Feb 29, 2016 at 2:08 PM, Mahesh Bandewar >> > wrote: >> > > From: Mahesh Bandewar >> > > >> > > netif_receive_skb_core() dispatcher uses skb->dev device to send it >> > > to the packet-handlers (e.g. ip_rcv, ipv6_rcv etc). These packet >> > > handlers intern use the device passed to determine the net-ns to >> > > further process these packets. Now with the nomination logic, the >> > > dispatcher will call netif_get_l3_dev() helper to select the device >> > > to be used for this processing. Since l3_dev is initialized to self, >> > > normal packet processing should not change. >> > > >> > >> > So, if I understand your patches correctly, _logically_ the skb is still >> > passed into the slave's netns via dev_forward_skb() but now goes over >> > the iptable rules from the default netns by only changing the netns >> > parameter to these hooks? >> > >> We are using different dev pointer for L3 processing than skb->dev. All >> netns, routing etc, associated with this dev (l3_dev) should be used for >> L3. >> >> > That is ugly... Logically, you should still need to continue to pass >> > the skb upper to the stack in default netns until >> > ip_local_deliver_finish(). >> > >> > >> > So, how about adding an iptable hook in ipvlan so that skb will >> > continue traverse in the original stack and then moved into slave's >> > netns? This might be harder since logically we need an L3 entrance >> > to the stack. >> > >> > Thoughts? >> >> As you mentioned logically we should be able to pass the skb in master's >> ns >> until L3 processing is completed. This patch series attempts to do that by >> disassociating this logic from skb->dev and adding it to l3_dev. This >> should >> include not just IPT but all that is done in L3 phase (IPT, routing etc.) >> Also since dev->l3_dev is same as dev, this should not break any existing >> logic. >> > Well, looking at the code I realized that I missed few places (especially > routing > logic) which continues using skb->dev in ingress path and should be > corrected to > use l3_dev. I'll correct those places and send the next version. Look, even you yourself are missing something here. ;) This is exactly why I suggest to consider another approach. Please don't introduce any code that is hard to debug even for yourself. The struct net pointer is passed around in kernel network subsystem almost everywhere, it is not easy to make it bug-free by just switching skb->dev. >> That's the generic implementation as far as the stack is concerned and >> IPvlan >> uses it to make the IPT hooks symmetric. >> >> Another IPT hook may be good enough (however I haven't >> given much thought to it) for IPvlan, but this generic approach will be >> for >> whole of L3. Also currently this I have implemented for the ingress path >> but that does not mean the same cannot be extended for the egress path >> (in fact I'm thinking about that) > This is logically correct and easier to understand or debug, since IPT hook is very common in network subsystem even ingress qdisc uses it.
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
On Wed, Mar 2, 2016 at 8:45 PM, Cong Wang wrote: > > On Mon, Feb 29, 2016 at 2:08 PM, Mahesh Bandewar wrote: > > From: Mahesh Bandewar > > > > netif_receive_skb_core() dispatcher uses skb->dev device to send it > > to the packet-handlers (e.g. ip_rcv, ipv6_rcv etc). These packet > > handlers intern use the device passed to determine the net-ns to > > further process these packets. Now with the nomination logic, the > > dispatcher will call netif_get_l3_dev() helper to select the device > > to be used for this processing. Since l3_dev is initialized to self, > > normal packet processing should not change. > > > > So, if I understand your patches correctly, _logically_ the skb is still > passed into the slave's netns via dev_forward_skb() but now goes over > the iptable rules from the default netns by only changing the netns > parameter to these hooks? > We are using different dev pointer for L3 processing than skb->dev. All netns, routing etc, associated with this dev (l3_dev) should be used for L3. > That is ugly... Logically, you should still need to continue to pass > the skb upper to the stack in default netns until ip_local_deliver_finish(). > > > So, how about adding an iptable hook in ipvlan so that skb will > continue traverse in the original stack and then moved into slave's > netns? This might be harder since logically we need an L3 entrance > to the stack. > > Thoughts? As you mentioned logically we should be able to pass the skb in master's ns until L3 processing is completed. This patch series attempts to do that by disassociating this logic from skb->dev and adding it to l3_dev. This should include not just IPT but all that is done in L3 phase (IPT, routing etc.) Also since dev->l3_dev is same as dev, this should not break any existing logic. That's the generic implementation as far as the stack is concerned and IPvlan uses it to make the IPT hooks symmetric. Another IPT hook may be good enough (however I haven't given much thought to it) for IPvlan, but this generic approach will be for whole of L3. Also currently this I have implemented for the ingress path but that does not mean the same cannot be extended for the egress path (in fact I'm thinking about that)
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
On Mon, Feb 29, 2016 at 2:08 PM, Mahesh Bandewar wrote: > From: Mahesh Bandewar > > netif_receive_skb_core() dispatcher uses skb->dev device to send it > to the packet-handlers (e.g. ip_rcv, ipv6_rcv etc). These packet > handlers intern use the device passed to determine the net-ns to > further process these packets. Now with the nomination logic, the > dispatcher will call netif_get_l3_dev() helper to select the device > to be used for this processing. Since l3_dev is initialized to self, > normal packet processing should not change. > So, if I understand your patches correctly, _logically_ the skb is still passed into the slave's netns via dev_forward_skb() but now goes over the iptable rules from the default netns by only changing the netns parameter to these hooks? That is ugly... Logically, you should still need to continue to pass the skb upper to the stack in default netns until ip_local_deliver_finish(). So, how about adding an iptable hook in ipvlan so that skb will continue traverse in the original stack and then moved into slave's netns? This might be harder since logically we need an L3 entrance to the stack. Thoughts?
[PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
From: Mahesh Bandewar netif_receive_skb_core() dispatcher uses skb->dev device to send it to the packet-handlers (e.g. ip_rcv, ipv6_rcv etc). These packet handlers intern use the device passed to determine the net-ns to further process these packets. Now with the nomination logic, the dispatcher will call netif_get_l3_dev() helper to select the device to be used for this processing. Since l3_dev is initialized to self, normal packet processing should not change. Signed-off-by: Mahesh Bandewar CC: Eric Dumazet CC: Tim Hockin CC: Alex Pollitt CC: Matthew Dupre --- net/core/dev.c | 9 ++--- net/ipv4/ip_input.c | 5 +++-- net/ipv6/ip6_input.c | 5 +++-- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index c4023a68cdc1..9252436ef11a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1811,7 +1811,8 @@ static inline int deliver_skb(struct sk_buff *skb, if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) return -ENOMEM; atomic_inc(&skb->users); - return pt_prev->func(skb, skb->dev, pt_prev, orig_dev); + return pt_prev->func(skb, netif_get_l3_dev(skb->dev), pt_prev, +orig_dev); } static inline void deliver_ptype_list_skb(struct sk_buff *skb, @@ -1904,7 +1905,8 @@ again: } out_unlock: if (pt_prev) - pt_prev->func(skb2, skb->dev, pt_prev, skb->dev); + pt_prev->func(skb2, netif_get_l3_dev(skb->dev), pt_prev, + skb->dev); rcu_read_unlock(); } @@ -4157,7 +4159,8 @@ ncls: if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) goto drop; else - ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); + ret = pt_prev->func(skb, netif_get_l3_dev(skb->dev), + pt_prev, orig_dev); } else { drop: if (!deliver_exact) diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index e3d782746d9d..b47164e3e1c6 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -247,7 +247,8 @@ int ip_local_deliver(struct sk_buff *skb) /* * Reassemble IP fragments. */ - struct net *net = dev_net(skb->dev); + struct net_device *dev = netif_get_l3_dev(skb->dev); + struct net *net = dev_net(dev); if (ip_is_fragment(ip_hdr(skb))) { if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER)) @@ -255,7 +256,7 @@ int ip_local_deliver(struct sk_buff *skb) } return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, - net, NULL, skb, skb->dev, NULL, + net, NULL, skb, dev, NULL, ip_local_deliver_finish); } diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index c05c425c2389..88443ac06402 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -287,9 +287,10 @@ discard: int ip6_input(struct sk_buff *skb) { + struct net_device *dev = netif_get_l3_dev(skb->dev); + return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN, - dev_net(skb->dev), NULL, skb, skb->dev, NULL, - ip6_input_finish); + dev_net(dev), NULL, skb, dev, NULL, ip6_input_finish); } int ip6_mc_input(struct sk_buff *skb) -- 2.7.0.rc3.207.g0ac5344