On Thu, Mar 10, 2016 at 1:47 AM, Nicolas Dichtel <nicolas.dich...@6wind.com> wrote: > Le 09/03/2016 22:49, Mahesh Bandewar a écrit : >> >> From: Mahesh Bandewar <mahe...@google.com> >> >> One of the major request (for enhancement) that I have received >> from various users of IPvlan in L3 mode is its inability to handle >> IPtables. >> >> While looking at the code and how we handle ingress, the problem >> can be attributed to the asymmetry in the way packets get processed >> for IPvlan devices configured in L3 mode. L3 mode is supposed to >> be restrictive and all the L3 decisions need to be taken for the >> traffic in master's ns. This does happen as expected for egress >> traffic however on ingress traffic, the IPvlan packet-handler >> changes the skb->dev and this forces packet to be processed with >> the IPvlan slave and it's associated ns. This causes above mentioned >> problem and few other which are not yet reported / attempted. e.g. >> IPsec with L3 mode or even ingress routing. >> >> This could have been solved if we had a way to handover packet to >> slave and associated ns after completing the L3 phase. This is a >> non-trivial issue to fix especially looking at IPsec code. >> >> This patch series attempts to solve this problem by introducing the >> device pointer l3_dev which resides in net_device structure in the >> RX cache line. We initialize the l3_dev to self. This would mean >> there is no complex logic to when-and-how-to initialize it. Now >> the stack will use this dev pointer during the L3 phase. This should >> not alter any existing properties / behavior and also there should >> not be any additional penalties since it resides in the same RX >> cache line. > > If I understand correctly (and as Cong already said), information are > leaking > between netns during the input phase. On the tx side, skb_scrub_packet() is > called, but not on the rx side. I think it's wrong. There should be an > explicit > boundary.
That is not what I am complaining about. I dislike the trick of switching skb->dev pointer with skb->dev->l3_dev. This is not how we switch netns, nor the way how netns works. Look at veth pair or dev_change_net_namespace(), each time when we switch netns, we need to do a full reregistration or a full reentrance, we never just switch some pointers to switch netns. This is why I said it breaks isolation. Also, it is ugly to hide such a ipvlan-specific pointer for half of the RX code path.