[PATCH ipsec 0/2] vti(6): fix ipv4 pmtu check to honor ip header df

2021-02-26 Thread Eyal Birger
This series aligns vti(6) handling of non-df IPv4 packets exceeding
the size of the tunnel MTU to avoid sending "Frag needed" and instead
fragment the packets after encapsulation.

Eyal Birger (2):
  vti: fix ipv4 pmtu check to honor ip header df
  vti6: fix ipv4 pmtu check to honor ip header df

 net/ipv4/ip_vti.c  | 6 --
 net/ipv6/ip6_vti.c | 6 --
 2 files changed, 8 insertions(+), 4 deletions(-)

-- 
2.25.1



[PATCH ipsec 1/2] vti: fix ipv4 pmtu check to honor ip header df

2021-02-26 Thread Eyal Birger
Frag needed should only be sent if the header enables DF.

This fix allows packets larger than MTU to pass the vti interface
and be fragmented after encapsulation, aligning behavior with
non-vti xfrm.

Fixes: d6af1a31cc72 ("vti: Add pmtu handling to vti_xmit.")
Signed-off-by: Eyal Birger 
---
 net/ipv4/ip_vti.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index abc171e79d3e..613741384490 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -218,7 +218,7 @@ static netdev_tx_t vti_xmit(struct sk_buff *skb, struct 
net_device *dev,
}
 
if (dst->flags & DST_XFRM_QUEUE)
-   goto queued;
+   goto xmit;
 
if (!vti_state_check(dst->xfrm, parms->iph.daddr, parms->iph.saddr)) {
dev->stats.tx_carrier_errors++;
@@ -238,6 +238,8 @@ static netdev_tx_t vti_xmit(struct sk_buff *skb, struct 
net_device *dev,
if (skb->len > mtu) {
skb_dst_update_pmtu_no_confirm(skb, mtu);
if (skb->protocol == htons(ETH_P_IP)) {
+   if (!(ip_hdr(skb)->frag_off & htons(IP_DF)))
+   goto xmit;
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
  htonl(mtu));
} else {
@@ -251,7 +253,7 @@ static netdev_tx_t vti_xmit(struct sk_buff *skb, struct 
net_device *dev,
goto tx_error;
}
 
-queued:
+xmit:
skb_scrub_packet(skb, !net_eq(tunnel->net, dev_net(dev)));
skb_dst_set(skb, dst);
skb->dev = skb_dst(skb)->dev;
-- 
2.25.1



[PATCH ipsec 2/2] vti6: fix ipv4 pmtu check to honor ip header df

2021-02-26 Thread Eyal Birger
Frag needed should only be sent if the header enables DF.

This fix allows IPv4 packets larger than MTU to pass the vti6 interface
and be fragmented after encapsulation, aligning behavior with
non-vti6 xfrm.

Fixes: ccd740cbc6e0 ("vti6: Add pmtu handling to vti6_xmit.")
Signed-off-by: Eyal Birger 
---
 net/ipv6/ip6_vti.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 0225fd694192..2f0be5ac021c 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -494,7 +494,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
}
 
if (dst->flags & DST_XFRM_QUEUE)
-   goto queued;
+   goto xmit;
 
x = dst->xfrm;
if (!vti6_state_check(x, &t->parms.raddr, &t->parms.laddr))
@@ -523,6 +523,8 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
 
icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
} else {
+   if (!(ip_hdr(skb)->frag_off & htons(IP_DF)))
+   goto xmit;
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
  htonl(mtu));
}
@@ -531,7 +533,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
goto tx_err_dst_release;
}
 
-queued:
+xmit:
skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev)));
skb_dst_set(skb, dst);
skb->dev = skb_dst(skb)->dev;
-- 
2.25.1



Re: High (200+) XFRM interface count performance problem (throughput)

2021-02-24 Thread Eyal Birger
Hi Vinš,

On Tue, Feb 23, 2021 at 9:52 PM Vinš Karel  wrote:
>
> Hello,
>
> I would like to ask you for help or advise.
>
> I'm testing setup with higher number of XFRM interfaces and I'm facing 
> throughput degradation with a growing number of created XFRM interfaces - not 
> concurrent tunnels established but only XFRM interfaces created - even in 
> DOWN state.
> Issue is only unidirectional - from "client" to "vpn hub". Throughput for 
> traffic from hub to client is not affected.
>
> XFRM interface created with:
> for i in {1..500}; do link add ipsec$i type xfrm dev ens224 if_id $i  ; done
>
> I'm testing with iperf3 with 1 client connected - from client to hub:
> 2 interfaces - 1.36 Gbps
> 100 interfaces - 1.35 Gbps
> 200 interfaces - 1.19 Gbps
> 300 interfaces - 0.98 Gbps
> 500 interfaces - 0.71 Gbps
>
> Throughput from hub to client is around 1.4 Gbps in all cases.
>
> 1 CPU core is 100%
>
> Linux v-hub 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux

Can you please try with a higher kernel version (>= 5.9)?
We've done some work to improve xfrm interface scaling
specifically e98e44562ba2
("xfrm interface: store xfrmi contexts in a hash by if_id").

Thanks,
Eyal.


Re: [PATCH ipsec,v2] xfrm: interface: fix ipv4 pmtu check to honor ip header df

2021-02-23 Thread Eyal Birger
Hi,

On Tue, Feb 23, 2021 at 5:18 PM Sabrina Dubroca  wrote:
>
> 2021-02-20, 15:01:15 +0200, Eyal Birger wrote:
> > Frag needed should only be sent if the header enables DF.
> >
> > This fix allows packets larger than MTU to pass the xfrm interface
> > and be fragmented after encapsulation, aligning behavior with
> > non-interface xfrm.
> >
> > Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
> > Signed-off-by: Eyal Birger 
> >
> > -
> >
> > v2: better align coding with ip_vti
>
> LGTM. We also need to do the same thing in ip_vti and ip6_vti. Do you
> want to take care of it, or should I?

I can submit the same fix for vti{,6}.

>
> Either way, for this patch:
> Reviewed-by: Sabrina Dubroca 

Thanks!
Eyal.


[PATCH ipsec,v2] xfrm: interface: fix ipv4 pmtu check to honor ip header df

2021-02-20 Thread Eyal Birger
Frag needed should only be sent if the header enables DF.

This fix allows packets larger than MTU to pass the xfrm interface
and be fragmented after encapsulation, aligning behavior with
non-interface xfrm.

Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Eyal Birger 

-

v2: better align coding with ip_vti
---
 net/xfrm/xfrm_interface.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 697cdcfbb5e1..3f42c2f15ba4 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -305,6 +305,8 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
 
icmpv6_ndo_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
} else {
+   if (!(ip_hdr(skb)->frag_off & htons(IP_DF)))
+   goto xmit;
icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
  htonl(mtu));
}
@@ -313,6 +315,7 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
return -EMSGSIZE;
}
 
+xmit:
xfrmi_scrub_packet(skb, !net_eq(xi->net, dev_net(dev)));
skb_dst_set(skb, dst);
skb->dev = tdev;
-- 
2.25.1



[PATCH ipsec] xfrm: interface: fix ipv4 pmtu check to honor ip header df

2021-02-19 Thread Eyal Birger
Frag needed should only be sent if the header enables DF.

This fix allows packets larger than MTU to pass the xfrm interface
and be fragmented after encapsulation, aligning behavior with
non-interface xfrm.

Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Eyal Birger 
---
 net/xfrm/xfrm_interface.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 697cdcfbb5e1..257b3c8b3995 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -304,13 +304,16 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
mtu = IPV6_MIN_MTU;
 
icmpv6_ndo_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
-   } else {
+   err = -EMSGSIZE;
+   goto tx_err_dst_release;
+   }
+
+   if (ip_hdr(skb)->frag_off & htons(IP_DF)) {
icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
  htonl(mtu));
+   err = -EMSGSIZE;
+   goto tx_err_dst_release;
}
-
-   dst_release(dst);
-   return -EMSGSIZE;
}
 
xfrmi_scrub_packet(skb, !net_eq(xi->net, dev_net(dev)));
-- 
2.25.1



Re: [PATCH net-next] net/packet: Improve the comment about LL header visibility criteria

2021-02-06 Thread Eyal Birger
Hi,

On Sat, Feb 6, 2021 at 4:52 AM Willem de Bruijn
 wrote:
>
> On Fri, Feb 5, 2021 at 5:42 PM Xie He  wrote:
> >
> > The "dev_has_header" function, recently added in
> > commit d549699048b4 ("net/packet: fix packet receive on L3 devices
> > without visible hard header"),
> > is more accurate as criteria for determining whether a device exposes
> > the LL header to upper layers, because in addition to dev->header_ops,
> > it also checks for dev->header_ops->create.
> >
> > When transmitting an skb on a device, dev_hard_header can be called to
> > generate an LL header. dev_hard_header will only generate a header if
> > dev->header_ops->create is present.
> >
> > Signed-off-by: Xie He 
>
> Acked-by: Willem de Bruijn 
>
> Indeed, existence of dev->header_ops->create is the deciding factor. Thanks 
> Xie.

As such, may I suggest making this explicit by making
dev_hard_header() use dev_has_header()?

Eyal.


[PATCH ipsec-next] xfrm: interface: enable TSO on xfrm interfaces

2021-01-05 Thread Eyal Birger
Underlying xfrm output supports gso packets.
Declare support in hw_features and adapt the xmit MTU check to pass GSO
packets.

Signed-off-by: Eyal Birger 
---
 net/xfrm/xfrm_interface.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 697cdcfbb5e1..495b1f5c979b 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -296,7 +296,8 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
}
 
mtu = dst_mtu(dst);
-   if (skb->len > mtu) {
+   if ((!skb_is_gso(skb) && skb->len > mtu) ||
+   (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu))) {
skb_dst_update_pmtu_no_confirm(skb, mtu);
 
if (skb->protocol == htons(ETH_P_IPV6)) {
@@ -564,6 +565,11 @@ static void xfrmi_dev_setup(struct net_device *dev)
eth_broadcast_addr(dev->broadcast);
 }
 
+#define XFRMI_FEATURES (NETIF_F_SG |   \
+   NETIF_F_FRAGLIST |  \
+   NETIF_F_GSO_SOFTWARE |  \
+   NETIF_F_HW_CSUM)
+
 static int xfrmi_dev_init(struct net_device *dev)
 {
struct xfrm_if *xi = netdev_priv(dev);
@@ -581,6 +587,8 @@ static int xfrmi_dev_init(struct net_device *dev)
}
 
dev->features |= NETIF_F_LLTX;
+   dev->features |= XFRMI_FEATURES;
+   dev->hw_features |= XFRMI_FEATURES;
 
if (phydev) {
dev->needed_headroom = phydev->needed_headroom;
-- 
2.25.1



Re: [RFC ipsec-next] xfrm: interface: enable TSO on xfrm interfaces

2021-01-04 Thread Eyal Birger
Hi Steffen,

On Mon, Jan 4, 2021 at 10:44 AM Steffen Klassert
 wrote:
>
> On Wed, Dec 23, 2020 at 09:15:38AM +0200, Eyal Birger wrote:
> > Underlying xfrm output supports gso packets.
> > Declare support in hw_features and adapt the xmit MTU check to pass GSO
> > packets.
> >
> > Signed-off-by: Eyal Birger 
>
> Looks ok to me.

Great, Thanks for the review.

 Should I submit a non-rfc patch once the merge window opens?

Eyal.


Re: [PATCH ipsec-next] xfrm: interface: support collect metadata mode

2020-12-27 Thread Eyal Birger
Hi Steffen,

On Mon, Dec 7, 2020 at 11:55 AM Steffen Klassert
 wrote:
>
> On Fri, Nov 27, 2020 at 02:32:44PM +0200, Eyal Birger wrote:
> > Hi Steffen,
> >
> > On Fri, Nov 27, 2020 at 11:44 AM Steffen Klassert
> >  wrote:
> > >
> > > On Sat, Nov 21, 2020 at 04:28:23PM +0200, Eyal Birger wrote:
> > > > This commit adds support for 'collect_md' mode on xfrm interfaces.
> > > >
> > > > Each net can have one collect_md device, created by providing the
> > > > IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
> > > > altered and has no if_id or link device attributes.
> > > >
> > > > On transmit to this device, the if_id is fetched from the attached dst
> > > > metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
> > > > since the only needed property is the if_id stored in the tun_id member
> > > > of the ip_tunnel_info->key.
> > >
> > > Can we please have a separate metadata type for xfrm interfaces?
> > >
> > > Sharing such structures turned already out to be a bad idea
> > > on vti interfaces, let's try to avoid that misstake with
> > > xfrm interfaces.
> >
> > My initial thought was to do that, but it looks like most of the constructs
> > surrounding this facility - tc, nft, ovs, ebpf, ip routing - are built 
> > around
> > struct ip_tunnel_info and don't regard other possible metadata types.
>
> That is likely because most objects that have a collect_md mode are
> tunnels. We have already a second metadata type, and I don't see
> why we can't have a third one. Maybe we can create something more
> generic so that it can have other users too.
>
> > For xfrm interfaces, the only metadata used is the if_id, which is stored
> > in the metadata tun_id, so I think other than naming consideration, the use
> > of struct ip_tunnel_info does not imply tunneling and does not limit the
> > use of xfrmi to a specific mode of operation.
>
> I agree that this can work, but it is a first step into a wrong direction.
> Using a __be64 field of a completely unrelated structure as an u32 if_id
> is bad style IMO.
>
> > On the other hand, adding a new metadata type would require changing all
> > other places to regard the new metadata type, with a large number of
> > userspace visible changes.
>
> I admit that this might have some disadvantages too, but I'm not convinced
> that this justifies the 'ip_tunnel_info' hack.
>

I understand. I'll try to come up with something more generic.
I hope I can find a way to still utilize the existing userspace
constructs.

Thanks!
Eyal.


[PATCH ipsec] xfrm: fix disable_xfrm sysctl when used on xfrm interfaces

2020-12-23 Thread Eyal Birger
The disable_xfrm flag signals that xfrm should not be performed during
routing towards a device before reaching device xmit.

For xfrm interfaces this is usually desired as they perform the outbound
policy lookup as part of their xmit using their if_id.

Before this change enabling this flag on xfrm interfaces prevented them
from xmitting as xfrm_lookup_with_ifid() would not perform a policy lookup
in case the original dst had the DST_NOXFRM flag.

This optimization is incorrect when the lookup is done by the xfrm
interface xmit logic.

Fix by performing policy lookup when invoked by xfrmi as if_id != 0.

Similarly it's unlikely for the 'no policy exists on net' check to yield
any performance benefits when invoked from xfrmi.

Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Eyal Birger 
---
 net/xfrm/xfrm_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index d622c2548d22..2f84136af48a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -3078,8 +3078,8 @@ struct dst_entry *xfrm_lookup_with_ifid(struct net *net,
xflo.flags = flags;
 
/* To accelerate a bit...  */
-   if ((dst_orig->flags & DST_NOXFRM) ||
-   !net->xfrm.policy_count[XFRM_POLICY_OUT])
+   if (!if_id && ((dst_orig->flags & DST_NOXFRM) ||
+  !net->xfrm.policy_count[XFRM_POLICY_OUT]))
goto nopol;
 
xdst = xfrm_bundle_lookup(net, fl, family, dir, &xflo, if_id);
-- 
2.25.1



[RFC ipsec-next] xfrm: interface: enable TSO on xfrm interfaces

2020-12-22 Thread Eyal Birger
Underlying xfrm output supports gso packets.
Declare support in hw_features and adapt the xmit MTU check to pass GSO
packets.

Signed-off-by: Eyal Birger 
---
 net/xfrm/xfrm_interface.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 9b8e292a7c6a..d28e9f05d9dd 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -296,7 +296,8 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
}
 
mtu = dst_mtu(dst);
-   if (skb->len > mtu) {
+   if ((!skb_is_gso(skb) && skb->len > mtu) ||
+   (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu))) {
skb_dst_update_pmtu_no_confirm(skb, mtu);
 
if (skb->protocol == htons(ETH_P_IPV6)) {
@@ -579,6 +580,11 @@ static void xfrmi_dev_setup(struct net_device *dev)
eth_broadcast_addr(dev->broadcast);
 }
 
+#define XFRMI_FEATURES (NETIF_F_SG |   \
+   NETIF_F_FRAGLIST |  \
+   NETIF_F_GSO_SOFTWARE |  \
+   NETIF_F_HW_CSUM)
+
 static int xfrmi_dev_init(struct net_device *dev)
 {
struct xfrm_if *xi = netdev_priv(dev);
@@ -596,6 +602,8 @@ static int xfrmi_dev_init(struct net_device *dev)
}
 
dev->features |= NETIF_F_LLTX;
+   dev->features |= XFRMI_FEATURES;
+   dev->hw_features |= XFRMI_FEATURES;
 
if (phydev) {
dev->needed_headroom = phydev->needed_headroom;
-- 
2.25.1



Re: [PATCH v2] xfrm: interface: Don't hide plain packets from netfilter

2020-12-10 Thread Eyal Birger
Hi Nicolas,

On Thu, Dec 10, 2020 at 1:10 PM Nicolas Dichtel
 wrote:
>
> Le 09/12/2020 à 15:40, Eyal Birger a écrit :
> > Hi Phil,
> >
> > On Tue, Dec 8, 2020 at 8:51 PM Phil Sutter  wrote:
> >>
> >> Hi Eyal,
> >>
> >> On Tue, Dec 08, 2020 at 04:47:02PM +0200, Eyal Birger wrote:
> >>> On Mon, Dec 7, 2020 at 4:07 PM Phil Sutter  wrote:
> [snip]
> >>
> >> The packet appears twice being sent to eth1, the second time as ESP
> >> packet. I understand xfrm interface as a collector of to-be-xfrmed
> >> packets, dropping those which do not match a policy.
> >>
> >>>> Fix this by looping packets transmitted from xfrm_interface through
> >>>> NF_INET_LOCAL_OUT before passing them on to dst_output(), which makes
> >>>> behaviour consistent again from netfilter's point of view.
> >>>
> >>> When an XFRM interface is used when forwarding, why would it be correct
> >>> for NF_INET_LOCAL_OUT to observe the inner packet?
> I think it is valid because:
>  - it would be consistent with ip tunnels (see iptunnel_xmit())

Are you referring to the flow:
  iptunnel_xmit()
ip_local_out()
  __ip_local_out()
nf_hook(.., NF_INET_LOCAL_OUT, ...)

If I understand that flow correctly it operates on the outer packet
as it is called after all the header had been pushed already. no?
Or are you referring to a different flow?

>  - it would be consistent with the standard xfrm path see [1]

In the regular path as well I understand the OUTPUT hooks are called
after xfrm encoding in the forwarding case, so they can't see the inner
packet.

>  - from the POV of the forwarder, the packet is locally emitted, the src @ is
>owned by the forwarder.

The inner IP source address is not owned by the forwarder to my understanding.

> >>
> >> A valid question, indeed. One could interpret packets being forwarded by
> >> those tunneling devices emit the packets one feeds them from the local
> >> host. I just checked and ip_vti behaves identical to xfrm_interface
> >> prior to my patch, so maybe my patch is crap and the inability to match
> >> on ipsec context data when using any of those devices is just by design.
> There was no real design for vti[6] interfaces, it's why xfrmi interfaces have
> been added. But they should be consistent I think, so this patch should handle
> xfrmi and vti[6] together.

I also think they should be consistent. But it'd still be confusing to me
to get an OUTPUT hook on the inner packet in the forwarding case.

Thanks,
Eyal.


Re: [PATCH v2] xfrm: interface: Don't hide plain packets from netfilter

2020-12-09 Thread Eyal Birger
Hi Phil,

On Tue, Dec 8, 2020 at 8:51 PM Phil Sutter  wrote:
>
> Hi Eyal,
>
> On Tue, Dec 08, 2020 at 04:47:02PM +0200, Eyal Birger wrote:
> > On Mon, Dec 7, 2020 at 4:07 PM Phil Sutter  wrote:
> > >
> > > With an IPsec tunnel without dedicated interface, netfilter sees locally
> > > generated packets twice as they exit the physical interface: Once as "the
> > > inner packet" with IPsec context attached and once as the encrypted
> > > (ESP) packet.
> > >
> > > With xfrm_interface, the inner packet did not traverse NF_INET_LOCAL_OUT
> > > hook anymore, making it impossible to match on both inner header values
> > > and associated IPsec data from that hook.
> > >
> >
> > Why wouldn't locally generated traffic not traverse the
> > NF_INET_LOCAL_OUT hook via e.g. __ip_local_out() when xmitted on an xfrmi?
> > I would expect it to appear in netfilter, but without the IPsec
> > context, as it's not
> > there yet.
>
> Yes, that's right. Having an iptables rule with LOG target in OUTPUT
> chain, a packet sent from the local host is logged multiple times:
>
> | IN= OUT=xfrm SRC=192.168.111.1 DST=192.168.111.2 LEN=84 TOS=0x00 PREC=0x00 
> TTL=64 ID=21840 DF
> | PROTO=ICMP TYPE=8 CODE=0 ID=56857 SEQ=1
> | IN= OUT=eth0 SRC=192.168.111.1 DST=192.168.111.2 LEN=84 TOS=0x00 PREC=0x00 
> TTL=64 ID=21840 DF PROTO=ICMP TYPE=8 CODE=0 ID=56857 SEQ=1
> | IN= OUT=eth0 SRC=192.168.1.1 DST=192.168.1.2 LEN=140 TOS=0x00 PREC=0x00 
> TTL=64 ID=0 DF PROTO=ESP SPI=0x1000
>
> First when being sent to xfrm interface, then two times between xfrm and
> eth0, the second time as ESP packet. This is with my patch applied.
> Without it, the second log entry is missing. I'm arguing the above is
> consistent to IPsec without xfrm interface:
>
> | IN= OUT=eth1 SRC=192.168.112.1 DST=192.168.112.2 LEN=84 TOS=0x00 PREC=0x00 
> TTL=64 ID=49341 DF PROTO=ICMP TYPE=8 CODE=0 ID=44114 SEQ=1
> | IN= OUT=eth1 SRC=192.168.2.1 DST=192.168.2.2 LEN=140 TOS=0x00 PREC=0x00 
> TTL=64 ID=37109 DF PROTO=ESP SPI=0x1000
>
> The packet appears twice being sent to eth1, the second time as ESP
> packet. I understand xfrm interface as a collector of to-be-xfrmed
> packets, dropping those which do not match a policy.
>
> > > Fix this by looping packets transmitted from xfrm_interface through
> > > NF_INET_LOCAL_OUT before passing them on to dst_output(), which makes
> > > behaviour consistent again from netfilter's point of view.
> >
> > When an XFRM interface is used when forwarding, why would it be correct
> > for NF_INET_LOCAL_OUT to observe the inner packet?
>
> A valid question, indeed. One could interpret packets being forwarded by
> those tunneling devices emit the packets one feeds them from the local
> host. I just checked and ip_vti behaves identical to xfrm_interface
> prior to my patch, so maybe my patch is crap and the inability to match
> on ipsec context data when using any of those devices is just by design.
>

I would find such interpretation and behavior to be surprising for an IPsec
forwarder...
I guess some functionality of policy matching is lost with these
devices; although they do offer the ability to match ipsec traffic based on
the destination interface it is possible to have multiple ipsec flows share
the same device so netfilter doesn't provide the ability to distinguish
between different flows on the outbound direction in such cases.

Thanks,
Eyal.


Re: [PATCH v2] xfrm: interface: Don't hide plain packets from netfilter

2020-12-08 Thread Eyal Birger
Hi Phil,

On Mon, Dec 7, 2020 at 4:07 PM Phil Sutter  wrote:
>
> With an IPsec tunnel without dedicated interface, netfilter sees locally
> generated packets twice as they exit the physical interface: Once as "the
> inner packet" with IPsec context attached and once as the encrypted
> (ESP) packet.
>
> With xfrm_interface, the inner packet did not traverse NF_INET_LOCAL_OUT
> hook anymore, making it impossible to match on both inner header values
> and associated IPsec data from that hook.
>

Why wouldn't locally generated traffic not traverse the
NF_INET_LOCAL_OUT hook via e.g. __ip_local_out() when xmitted on an xfrmi?
I would expect it to appear in netfilter, but without the IPsec
context, as it's not
there yet.

> Fix this by looping packets transmitted from xfrm_interface through
> NF_INET_LOCAL_OUT before passing them on to dst_output(), which makes
> behaviour consistent again from netfilter's point of view.

When an XFRM interface is used when forwarding, why would it be correct
for NF_INET_LOCAL_OUT to observe the inner packet?

What am I missing?

Thanks!
Eyal.


Re: [PATCH ipsec-next] xfrm: interface: support collect metadata mode

2020-11-27 Thread Eyal Birger
Hi Steffen,

On Fri, Nov 27, 2020 at 11:44 AM Steffen Klassert
 wrote:
>
> On Sat, Nov 21, 2020 at 04:28:23PM +0200, Eyal Birger wrote:
> > This commit adds support for 'collect_md' mode on xfrm interfaces.
> >
> > Each net can have one collect_md device, created by providing the
> > IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
> > altered and has no if_id or link device attributes.
> >
> > On transmit to this device, the if_id is fetched from the attached dst
> > metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
> > since the only needed property is the if_id stored in the tun_id member
> > of the ip_tunnel_info->key.
>
> Can we please have a separate metadata type for xfrm interfaces?
>
> Sharing such structures turned already out to be a bad idea
> on vti interfaces, let's try to avoid that misstake with
> xfrm interfaces.

My initial thought was to do that, but it looks like most of the constructs
surrounding this facility - tc, nft, ovs, ebpf, ip routing - are built around
struct ip_tunnel_info and don't regard other possible metadata types.

For xfrm interfaces, the only metadata used is the if_id, which is stored
in the metadata tun_id, so I think other than naming consideration, the use
of struct ip_tunnel_info does not imply tunneling and does not limit the
use of xfrmi to a specific mode of operation.

On the other hand, adding a new metadata type would require changing all
other places to regard the new metadata type, with a large number of
userspace visible changes.

>
> > On the receive side, xfrmi_rcv_cb() populates a dst metadata for each
> > packet received and attaches it to the skb. The if_id used in this case is
> > fetched from the xfrm state. This can later be used by upper layers such
> > as tc, ebpf, and ip rules.
> >
> > Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst
> > metadata is postponed until after scrubing. Similarly, xfrm_input() is
> > adapted to avoid dropping metadata dsts by only dropping 'valid'
> > (skb_valid_dst(skb) == true) dsts.
> >
> > Policy matching on packets arriving from collect_md xfrmi devices is
> > done by using the xfrm state existing in the skb's sec_path.
> > The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session()
> > is changed to keep the details of the if_id extraction tucked away
> > in xfrm_interface.c.
> >
> > Signed-off-by: Eyal Birger 
>
> The rest of the patch looks good.

Thanks for the review!
Eyal.


[PATCH ipsec-next] xfrm: interface: support collect metadata mode

2020-11-21 Thread Eyal Birger
This commit adds support for 'collect_md' mode on xfrm interfaces.

Each net can have one collect_md device, created by providing the
IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
altered and has no if_id or link device attributes.

On transmit to this device, the if_id is fetched from the attached dst
metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
since the only needed property is the if_id stored in the tun_id member
of the ip_tunnel_info->key.

On the receive side, xfrmi_rcv_cb() populates a dst metadata for each
packet received and attaches it to the skb. The if_id used in this case is
fetched from the xfrm state. This can later be used by upper layers such
as tc, ebpf, and ip rules.

Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst
metadata is postponed until after scrubing. Similarly, xfrm_input() is
adapted to avoid dropping metadata dsts by only dropping 'valid'
(skb_valid_dst(skb) == true) dsts.

Policy matching on packets arriving from collect_md xfrmi devices is
done by using the xfrm state existing in the skb's sec_path.
The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session()
is changed to keep the details of the if_id extraction tucked away
in xfrm_interface.c.

Signed-off-by: Eyal Birger 
---
 include/net/xfrm.h   |  11 +++-
 include/uapi/linux/if_link.h |   1 +
 net/xfrm/xfrm_input.c|   7 ++-
 net/xfrm/xfrm_interface.c| 114 ++-
 net/xfrm/xfrm_policy.c   |  10 +--
 5 files changed, 119 insertions(+), 24 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index b2a06f10b62c..925f8dcdd0db 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -308,9 +308,15 @@ struct xfrm_replay {
int (*overflow)(struct xfrm_state *x, struct sk_buff *skb);
 };
 
+struct xfrm_if_decode_session_params {
+   struct net *net;
+   u32 if_id;
+};
+
 struct xfrm_if_cb {
-   struct xfrm_if  *(*decode_session)(struct sk_buff *skb,
-  unsigned short family);
+   bool (*decode_session)(struct sk_buff *skb,
+  unsigned short family,
+  struct xfrm_if_decode_session_params *params);
 };
 
 void xfrm_if_register_cb(const struct xfrm_if_cb *ifcb);
@@ -984,6 +990,7 @@ void xfrm_dst_ifdown(struct dst_entry *dst, struct 
net_device *dev);
 struct xfrm_if_parms {
int link;   /* ifindex of underlying L2 interface */
u32 if_id;  /* interface identifyer */
+   bool collect_md;
 };
 
 struct xfrm_if {
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index c4b23f06f69e..ff04a06c2b69 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -655,6 +655,7 @@ enum {
IFLA_XFRM_UNSPEC,
IFLA_XFRM_LINK,
IFLA_XFRM_IF_ID,
+   IFLA_XFRM_COLLECT_METADATA,
__IFLA_XFRM_MAX
 };
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index be6351e3f3cd..c7de46d30697 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "xfrm_inout.h"
 
@@ -719,7 +720,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
sp = skb_sec_path(skb);
if (sp)
sp->olen = 0;
-   skb_dst_drop(skb);
+   if (skb_valid_dst(skb))
+   skb_dst_drop(skb);
gro_cells_receive(&gro_cells, skb);
return 0;
} else {
@@ -737,7 +739,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
sp = skb_sec_path(skb);
if (sp)
sp->olen = 0;
-   skb_dst_drop(skb);
+   if (skb_valid_dst(skb))
+   skb_dst_drop(skb);
gro_cells_receive(&gro_cells, skb);
return err;
}
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 9b8e292a7c6a..10c14967d305 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -56,11 +57,22 @@ static const struct net_device_ops xfrmi_netdev_ops;
 struct xfrmi_net {
/* lists for storing interfaces in use */
struct xfrm_if __rcu *xfrmi[XFRMI_HASH_SIZE];
+   struct xfrm_if __rcu *collect_md_xfrmi;
 };
 
 #define for_each_xfrmi_rcu(start, xi) \
for (xi = rcu_dereference(start); xi; xi = rcu_dereference(xi->next))
 
+static u32 tunnel_id_to_if_id(__be64 tun_id)
+{
+   return ntohl(tunnel_id_to_key32(tun_id));
+}
+
+static __be64 if_id_to_tunnel_id(u32 if_id)
+{
+   return key32_to_t

[PATCH net,v2] net/packet: fix packet receive on L3 devices without visible hard header

2020-11-20 Thread Eyal Birger
In the patchset merged by commit b9fcf0a0d826
("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
did not have header_ops were given one for the purpose of protocol parsing
on af_packet transmit path.

That change made af_packet receive path regard these devices as having a
visible L3 header and therefore aligned incoming skb->data to point to the
skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
reset their mac_header prior to ingress and therefore their incoming
packets became malformed.

Ideally these devices would reset their mac headers, or af_packet would be
able to rely on dev->hard_header_len being 0 for such cases, but it seems
this is not the case.

Fix by changing af_packet RX ll visibility criteria to include the
existence of a '.create()' header operation, which is used when creating
a device hard header - via dev_hard_header() - by upper layers, and does
not exist in these L3 devices.

As this predicate may be useful in other situations, add it as a common
dev_has_header() helper in netdevice.h.

Fixes: b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
Signed-off-by: Eyal Birger 

---

v2:
  - add common dev_has_header() helper as suggested by Willem de Bruijn
---
 include/linux/netdevice.h |  5 +
 net/packet/af_packet.c| 18 +-
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 964b494b0e8d..fa275a054f46 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3137,6 +3137,11 @@ static inline bool dev_validate_header(const struct 
net_device *dev,
return false;
 }
 
+static inline bool dev_has_header(const struct net_device *dev)
+{
+   return dev->header_ops && dev->header_ops->create;
+}
+
 typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr,
   int len, int size);
 int register_gifconf(unsigned int family, gifconf_func_t *gifconf);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index cefbd50c1090..7a188551 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -93,8 +93,8 @@
 
 /*
Assumptions:
-   - If the device has no dev->header_ops, there is no LL header visible
- above the device. In this case, its hard_header_len should be 0.
+   - If the device has no dev->header_ops->create, there is no LL header
+ visible above the device. In this case, its hard_header_len should be 0.
  The device may prepend its own header internally. In this case, its
  needed_headroom should be set to the space needed for it to add its
  internal header.
@@ -108,26 +108,26 @@
 On receive:
 ---
 
-Incoming, dev->header_ops != NULL
+Incoming, dev_has_header(dev) == true
mac_header -> ll header
data   -> data
 
-Outgoing, dev->header_ops != NULL
+Outgoing, dev_has_header(dev) == true
mac_header -> ll header
data   -> ll header
 
-Incoming, dev->header_ops == NULL
+Incoming, dev_has_header(dev) == false
mac_header -> data
  However drivers often make it point to the ll header.
  This is incorrect because the ll header should be invisible to us.
data   -> data
 
-Outgoing, dev->header_ops == NULL
+Outgoing, dev_has_header(dev) == false
mac_header -> data. ll header is invisible to us.
data   -> data
 
 Resume
-  If dev->header_ops == NULL we are unable to restore the ll header,
+  If dev_has_header(dev) == false we are unable to restore the ll header,
 because it is invisible to us.
 
 
@@ -2069,7 +2069,7 @@ static int packet_rcv(struct sk_buff *skb, struct 
net_device *dev,
 
skb->dev = dev;
 
-   if (dev->header_ops) {
+   if (dev_has_header(dev)) {
/* The device has an explicit notion of ll header,
 * exported to higher levels.
 *
@@ -2198,7 +2198,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
if (!net_eq(dev_net(dev), sock_net(sk)))
goto drop;
 
-   if (dev->header_ops) {
+   if (dev_has_header(dev)) {
if (sk->sk_type != SOCK_DGRAM)
skb_push(skb, skb->data - skb_mac_header(skb));
else if (skb->pkt_type == PACKET_OUTGOING) {
-- 
2.25.1



[net,v2] net/packet: fix packet receive on L3 devices without visible hard header

2020-11-20 Thread Eyal Birger
In the patchset merged by commit b9fcf0a0d826
("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
did not have header_ops were given one for the purpose of protocol parsing
on af_packet transmit path.

That change made af_packet receive path regard these devices as having a
visible L3 header and therefore aligned incoming skb->data to point to the
skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
reset their mac_header prior to ingress and therefore their incoming
packets became malformed.

Ideally these devices would reset their mac headers, or af_packet would be
able to rely on dev->hard_header_len being 0 for such cases, but it seems
this is not the case.

Fix by changing af_packet RX ll visibility criteria to include the
existence of a '.create()' header operation, which is used when creating
a device hard header - via dev_hard_header() - by upper layers, and does
not exist in these L3 devices.

As this predicate may be useful in other situations, add it as a common
dev_has_header() helper in netdevice.h.

Fixes: b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
Signed-off-by: Eyal Birger 

---

v2:
  - add common dev_has_header() helper as suggested by Willem de Bruijn
---
 include/linux/netdevice.h |  5 +
 net/packet/af_packet.c| 18 +-
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 964b494b0e8d..fa275a054f46 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3137,6 +3137,11 @@ static inline bool dev_validate_header(const struct 
net_device *dev,
return false;
 }
 
+static inline bool dev_has_header(const struct net_device *dev)
+{
+   return dev->header_ops && dev->header_ops->create;
+}
+
 typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr,
   int len, int size);
 int register_gifconf(unsigned int family, gifconf_func_t *gifconf);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index cefbd50c1090..7a188551 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -93,8 +93,8 @@
 
 /*
Assumptions:
-   - If the device has no dev->header_ops, there is no LL header visible
- above the device. In this case, its hard_header_len should be 0.
+   - If the device has no dev->header_ops->create, there is no LL header
+ visible above the device. In this case, its hard_header_len should be 0.
  The device may prepend its own header internally. In this case, its
  needed_headroom should be set to the space needed for it to add its
  internal header.
@@ -108,26 +108,26 @@
 On receive:
 ---
 
-Incoming, dev->header_ops != NULL
+Incoming, dev_has_header(dev) == true
mac_header -> ll header
data   -> data
 
-Outgoing, dev->header_ops != NULL
+Outgoing, dev_has_header(dev) == true
mac_header -> ll header
data   -> ll header
 
-Incoming, dev->header_ops == NULL
+Incoming, dev_has_header(dev) == false
mac_header -> data
  However drivers often make it point to the ll header.
  This is incorrect because the ll header should be invisible to us.
data   -> data
 
-Outgoing, dev->header_ops == NULL
+Outgoing, dev_has_header(dev) == false
mac_header -> data. ll header is invisible to us.
data   -> data
 
 Resume
-  If dev->header_ops == NULL we are unable to restore the ll header,
+  If dev_has_header(dev) == false we are unable to restore the ll header,
 because it is invisible to us.
 
 
@@ -2069,7 +2069,7 @@ static int packet_rcv(struct sk_buff *skb, struct 
net_device *dev,
 
skb->dev = dev;
 
-   if (dev->header_ops) {
+   if (dev_has_header(dev)) {
/* The device has an explicit notion of ll header,
 * exported to higher levels.
 *
@@ -2198,7 +2198,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
if (!net_eq(dev_net(dev), sock_net(sk)))
goto drop;
 
-   if (dev->header_ops) {
+   if (dev_has_header(dev)) {
if (sk->sk_type != SOCK_DGRAM)
skb_push(skb, skb->data - skb_mac_header(skb));
else if (skb->pkt_type == PACKET_OUTGOING) {
-- 
2.25.1



[net] net/packet: fix incoming receive for L3 devices without visible hard header

2020-11-19 Thread Eyal Birger
In the patchset merged by commit b9fcf0a0d826
("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
did not have header_ops were given one for the purpose of protocol parsing
on af_packet transmit path.

That change made af_packet receive path regard these devices as having a
visible L3 header and therefore aligned incoming skb->data to point to the
skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
reset their mac_header prior to ingress and therefore their incoming
packets became malformed.

Ideally these devices would reset their mac headers, or af_packet would be
able to rely on dev->hard_header_len being 0 for such cases, but it seems
this is not the case.

Fix by changing af_packet RX ll visibility criteria to include the
existence of a '.create()' header operation, which is used when creating
a device hard header - via dev_hard_header() - by upper layers, and does
not exist in these L3 devices.

Fixes: b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
Signed-off-by: Eyal Birger 
---
 net/packet/af_packet.c | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index cefbd50c1090..a241059fd536 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -93,8 +93,8 @@
 
 /*
Assumptions:
-   - If the device has no dev->header_ops, there is no LL header visible
- above the device. In this case, its hard_header_len should be 0.
+   - If the device has no dev->header_ops->create, there is no LL header
+ visible above the device. In this case, its hard_header_len should be 0.
  The device may prepend its own header internally. In this case, its
  needed_headroom should be set to the space needed for it to add its
  internal header.
@@ -108,21 +108,21 @@
 On receive:
 ---
 
-Incoming, dev->header_ops != NULL
+Incoming, dev->header_ops != NULL && dev->header_ops->create != NULL
mac_header -> ll header
data   -> data
 
-Outgoing, dev->header_ops != NULL
+Outgoing, dev->header_ops != NULL && dev->header_ops->create != NULL
mac_header -> ll header
data   -> ll header
 
-Incoming, dev->header_ops == NULL
+Incoming, dev->header_ops == NULL || dev->header_ops->create == NULL
mac_header -> data
  However drivers often make it point to the ll header.
  This is incorrect because the ll header should be invisible to us.
data   -> data
 
-Outgoing, dev->header_ops == NULL
+Outgoing, dev->header_ops == NULL || dev->header_ops->create == NULL
mac_header -> data. ll header is invisible to us.
data   -> data
 
@@ -272,6 +272,18 @@ static bool packet_use_direct_xmit(const struct 
packet_sock *po)
return po->xmit == packet_direct_xmit;
 }
 
+static bool packet_ll_header_rcv_visible(const struct net_device *dev)
+{
+   /* The device has an explicit notion of ll header,
+* exported to higher levels
+*
+* Otherwise, the device hides details of its frame
+* structure, so that corresponding packet head is
+* never delivered to user.
+*/
+   return dev->header_ops && dev->header_ops->create;
+}
+
 static u16 packet_pick_tx_queue(struct sk_buff *skb)
 {
struct net_device *dev = skb->dev;
@@ -2069,7 +2081,7 @@ static int packet_rcv(struct sk_buff *skb, struct 
net_device *dev,
 
skb->dev = dev;
 
-   if (dev->header_ops) {
+   if (packet_ll_header_rcv_visible(dev)) {
/* The device has an explicit notion of ll header,
 * exported to higher levels.
 *
@@ -2198,7 +2210,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
if (!net_eq(dev_net(dev), sock_net(sk)))
goto drop;
 
-   if (dev->header_ops) {
+   if (packet_ll_header_rcv_visible(dev)) {
if (sk->sk_type != SOCK_DGRAM)
skb_push(skb, skb->data - skb_mac_header(skb));
else if (skb->pkt_type == PACKET_OUTGOING) {
-- 
2.25.1



[iproute2] ipntable: add missing ndts_table_fulls ntable stat

2020-10-02 Thread Eyal Birger
Used for tracking neighbour table overflows.

Signed-off-by: Eyal Birger 
---
 ip/ipntable.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/ip/ipntable.c b/ip/ipntable.c
index ddee4905..b5b06a3b 100644
--- a/ip/ipntable.c
+++ b/ip/ipntable.c
@@ -517,6 +517,11 @@ static void print_ndtstats(const struct ndt_stats *ndts)
print_u64(PRINT_ANY, "forced_gc_runs", "forced_gc_runs %llu ",
   ndts->ndts_forced_gc_runs);
 
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+
+   print_u64(PRINT_ANY, "table_fulls", "table_fulls %llu ",
+ ndts->ndts_table_fulls);
+
print_nl();
 }
 
-- 
2.25.1



Re: BPF redirect API design issue for BPF-prog MTU feedback?

2020-09-21 Thread Eyal Birger
On Mon, Sep 21, 2020 at 7:30 PM Jesper Dangaard Brouer
 wrote:
>
> On Mon, 21 Sep 2020 17:08:17 +0200
> Daniel Borkmann  wrote:
>
> > On 9/21/20 2:49 PM, Jesper Dangaard Brouer wrote:
> > > On Mon, 21 Sep 2020 11:37:18 +0100
> > > Lorenz Bauer  wrote:
> > >> On Sat, 19 Sep 2020 at 00:06, Maciej Żenczykowski  
> > >> wrote:
> > >>>
> >  This is a good point.  As bpf_skb_adjust_room() can just be run after
> >  bpf_redirect() call, then a MTU check in bpf_redirect() actually
> >  doesn't make much sense.  As clever/bad BPF program can then avoid the
> >  MTU check anyhow.  This basically means that we have to do the MTU
> >  check (again) on kernel side anyhow to catch such clever/bad BPF
> >  programs.  (And I don't like wasting cycles on doing the same check two
> >  times).
> > >>>
> > >>> If you get rid of the check in bpf_redirect() you might as well get
> > >>> rid of *all* the checks for excessive mtu in all the helpers that
> > >>> adjust packet size one way or another way.  They *all* then become
> > >>> useless overhead.
> > >>>
> > >>> I don't like that.  There may be something the bpf program could do to
> > >>> react to the error condition (for example in my case, not modify
> > >>> things and just let the core stack deal with things - which will
> > >>> probably just generate packet too big icmp error).
> > >>>
> > >>> btw. right now our forwarding programs first adjust the packet size
> > >>> then call bpf_redirect() and almost immediately return what it
> > >>> returned.
> > >>>
> > >>> but this could I think easily be changed to reverse the ordering, so
> > >>> we wouldn't increase packet size before the core stack was informed we
> > >>> would be forwarding via a different interface.
> > >>
> > >> We do the same, except that we also use XDP_TX when appropriate. This
> > >> complicates the matter, because there is no helper call we could
> > >> return an error from.
> > >
> > > Do notice that my MTU work is focused on TC-BPF.  For XDP-redirect the
> > > MTU check is done in xdp_ok_fwd_dev() via __xdp_enqueue(), which also
> > > happens too late to give BPF-prog knowledge/feedback.  For XDP_TX I
> > > audited the drivers when I implemented xdp_buff.frame_sz, and they
> > > handled (or I added) handling against max HW MTU. E.g. mlx5 [1].
> > >
> > > [1] 
> > > https://elixir.bootlin.com/linux/v5.9-rc6/source/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L267
> > >
> > >> My preference would be to have three helpers: get MTU for a device,
> > >> redirect ctx to a device (with MTU check), resize ctx (without MTU
> > >> check) but that doesn't work with XDP_TX. Your idea of doing checks
> > >> in redirect and adjust_room is pragmatic and seems easier to
> > >> implement.
> > >
> > > I do like this plan/proposal (with 3 helpers), but it is not possible
> > > with current API.  The main problem is the current bpf_redirect API
> > > doesn't provide the ctx, so we cannot do the check in the BPF-helper.
> > >
> > > Are you saying we should create a new bpf_redirect API (that incl packet 
> > > ctx)?
> >
> > Sorry for jumping in late here... one thing that is not clear to me
> > is that if we are fully sure that skb is dropped by stack anyway due
> > to invalid MTU (redirect to ingress does this via dev_forward_skb(),
>
> Yes, TC-redirecting to *INGRESS* have a slightly relaxed MTU check via
> is_skb_forwardable() called via dev_forward_skb().  This MTU check
> seems redundant as netstack will do MTU checks anyhow.
>

I found the MTU check on redirect-to-ingress to be very unexpected.

We hit this when implementing NAT64 as a tc egress program which translates
the packet and redirects it to ingress from the same device.

It is beneficial to have the MTU of the device set to a limit fitting the
IPv4 MTU, so that the IP stack would fragment as needed on the IPv4->IPv6
path. But when translating the packet to IPv6, it can no longer be ingressed
from the same device because of the MTU check. Packets are silently dropped
without any hint.

So would definitely be nice if this check is removed, or a flag to avoid
it is supported in bpf_redirect().

Eyal.


[PATCH ipsec-next 1/2] xfrm interface: avoid xi lookup in xfrmi_decode_session()

2020-07-09 Thread Eyal Birger
The xfrmi context exists in the netdevice priv context.
Avoid looking for it in a separate list.

Signed-off-by: Eyal Birger 
---
 net/xfrm/xfrm_interface.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index c407ecbc5d46..069dafeba873 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -47,6 +47,7 @@ static int xfrmi_dev_init(struct net_device *dev);
 static void xfrmi_dev_setup(struct net_device *dev);
 static struct rtnl_link_ops xfrmi_link_ops __read_mostly;
 static unsigned int xfrmi_net_id __read_mostly;
+static const struct net_device_ops xfrmi_netdev_ops;
 
 struct xfrmi_net {
/* lists for storing interfaces in use */
@@ -73,8 +74,7 @@ static struct xfrm_if *xfrmi_lookup(struct net *net, struct 
xfrm_state *x)
 static struct xfrm_if *xfrmi_decode_session(struct sk_buff *skb,
unsigned short family)
 {
-   struct xfrmi_net *xfrmn;
-   struct xfrm_if *xi;
+   struct net_device *dev;
int ifindex = 0;
 
if (!secpath_exists(skb) || !skb->dev)
@@ -88,18 +88,21 @@ static struct xfrm_if *xfrmi_decode_session(struct sk_buff 
*skb,
ifindex = inet_sdif(skb);
break;
}
-   if (!ifindex)
-   ifindex = skb->dev->ifindex;
 
-   xfrmn = net_generic(xs_net(xfrm_input_state(skb)), xfrmi_net_id);
+   if (ifindex) {
+   struct net *net = xs_net(xfrm_input_state(skb));
 
-   for_each_xfrmi_rcu(xfrmn->xfrmi[0], xi) {
-   if (ifindex == xi->dev->ifindex &&
-   (xi->dev->flags & IFF_UP))
-   return xi;
+   dev = dev_get_by_index_rcu(net, ifindex);
+   } else {
+   dev = skb->dev;
}
 
-   return NULL;
+   if (!dev || !(dev->flags & IFF_UP))
+   return NULL;
+   if (dev->netdev_ops != &xfrmi_netdev_ops)
+   return NULL;
+
+   return netdev_priv(dev);
 }
 
 static void xfrmi_link(struct xfrmi_net *xfrmn, struct xfrm_if *xi)
-- 
2.25.1



[PATCH ipsec-next 0/2] xfrm interface: use hash to store xfrmi contexts

2020-07-09 Thread Eyal Birger
When having many xfrm interfaces, the linear lookup of devices based on
if_id becomes costly.

The first patch refactors xfrmi_decode_session() to use the xi used in
the netdevice priv context instead of looking it up in the list based
on ifindex. This is needed in order to use if_id as the only key used
for xi lookup.

The second patch extends the existing infrastructure - which already
stores the xfrmi contexts in an array of lists - to use a hash of the
if_id.

Example benchmarks:
- running on a KVM based VM
- xfrm tunnel mode between two namespaces
- xfrm interface in one namespace (10.0.0.2)

Before this change set:

Single xfrm interface in namespace:
$ netperf -H 10.0.0.2 -l8 -I95,10 -t TCP_STREAM

MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.2 () 
port 0 AF_INET : +/-5.000% @ 95% conf.  : demo
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

131072  16384  163848.00  298.36

After adding 400 xfrmi interfaces in the same namespace:

$ netperf -H 10.0.0.2 -l8 -I95,10 -t TCP_STREAM

MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.2 () 
port 0 AF_INET : +/-5.000% @ 95% conf.  : demo
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

131072  16384  163848.00  221.77   

After this patchset there was no observed change after adding the
xfrmi interfaces.

Eyal Birger (2):
  xfrm interface: avoid xi lookup in xfrmi_decode_session()
  xfrm interface: store xfrmi contexts in a hash by if_id

 net/xfrm/xfrm_interface.c | 52 +--
 1 file changed, 33 insertions(+), 19 deletions(-)

-- 
2.25.1



[PATCH ipsec-next 2/2] xfrm interface: store xfrmi contexts in a hash by if_id

2020-07-09 Thread Eyal Birger
xfrmi_lookup() is called on every packet. Using a single list for
looking up if_id becomes a bottleneck when having many xfrm interfaces.

Signed-off-by: Eyal Birger 
---
 net/xfrm/xfrm_interface.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 069dafeba873..f4ec117e9110 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -49,20 +49,28 @@ static struct rtnl_link_ops xfrmi_link_ops __read_mostly;
 static unsigned int xfrmi_net_id __read_mostly;
 static const struct net_device_ops xfrmi_netdev_ops;
 
+#define XFRMI_HASH_BITS8
+#define XFRMI_HASH_SIZEBIT(XFRMI_HASH_BITS)
+
 struct xfrmi_net {
/* lists for storing interfaces in use */
-   struct xfrm_if __rcu *xfrmi[1];
+   struct xfrm_if __rcu *xfrmi[XFRMI_HASH_SIZE];
 };
 
 #define for_each_xfrmi_rcu(start, xi) \
for (xi = rcu_dereference(start); xi; xi = rcu_dereference(xi->next))
 
+static u32 xfrmi_hash(u32 if_id)
+{
+   return hash_32(if_id, XFRMI_HASH_BITS);
+}
+
 static struct xfrm_if *xfrmi_lookup(struct net *net, struct xfrm_state *x)
 {
struct xfrmi_net *xfrmn = net_generic(net, xfrmi_net_id);
struct xfrm_if *xi;
 
-   for_each_xfrmi_rcu(xfrmn->xfrmi[0], xi) {
+   for_each_xfrmi_rcu(xfrmn->xfrmi[xfrmi_hash(x->if_id)], xi) {
if (x->if_id == xi->p.if_id &&
(xi->dev->flags & IFF_UP))
return xi;
@@ -107,7 +115,7 @@ static struct xfrm_if *xfrmi_decode_session(struct sk_buff 
*skb,
 
 static void xfrmi_link(struct xfrmi_net *xfrmn, struct xfrm_if *xi)
 {
-   struct xfrm_if __rcu **xip = &xfrmn->xfrmi[0];
+   struct xfrm_if __rcu **xip = &xfrmn->xfrmi[xfrmi_hash(xi->p.if_id)];
 
rcu_assign_pointer(xi->next , rtnl_dereference(*xip));
rcu_assign_pointer(*xip, xi);
@@ -118,7 +126,7 @@ static void xfrmi_unlink(struct xfrmi_net *xfrmn, struct 
xfrm_if *xi)
struct xfrm_if __rcu **xip;
struct xfrm_if *iter;
 
-   for (xip = &xfrmn->xfrmi[0];
+   for (xip = &xfrmn->xfrmi[xfrmi_hash(xi->p.if_id)];
 (iter = rtnl_dereference(*xip)) != NULL;
 xip = &iter->next) {
if (xi == iter) {
@@ -162,7 +170,7 @@ static struct xfrm_if *xfrmi_locate(struct net *net, struct 
xfrm_if_parms *p)
struct xfrm_if *xi;
struct xfrmi_net *xfrmn = net_generic(net, xfrmi_net_id);
 
-   for (xip = &xfrmn->xfrmi[0];
+   for (xip = &xfrmn->xfrmi[xfrmi_hash(p->if_id)];
 (xi = rtnl_dereference(*xip)) != NULL;
 xip = &xi->next)
if (xi->p.if_id == p->if_id)
@@ -761,11 +769,14 @@ static void __net_exit xfrmi_exit_batch_net(struct 
list_head *net_exit_list)
struct xfrmi_net *xfrmn = net_generic(net, xfrmi_net_id);
struct xfrm_if __rcu **xip;
struct xfrm_if *xi;
+   int i;
 
-   for (xip = &xfrmn->xfrmi[0];
-(xi = rtnl_dereference(*xip)) != NULL;
-xip = &xi->next)
-   unregister_netdevice_queue(xi->dev, &list);
+   for (i = 0; i < XFRMI_HASH_SIZE; i++) {
+   for (xip = &xfrmn->xfrmi[i];
+(xi = rtnl_dereference(*xip)) != NULL;
+xip = &xi->next)
+   unregister_netdevice_queue(xi->dev, &list);
+   }
}
unregister_netdevice_many(&list);
rtnl_unlock();
-- 
2.25.1



[PATCH iproute2,v2 1/2] ip xfrm: update man page on setting/printing XFRMA_IF_ID in states/policies

2020-07-08 Thread Eyal Birger
In commit aed63ae1acb9 ("ip xfrm: support setting/printing XFRMA_IF_ID 
attribute in states/policies")
I added the ability to set/print the xfrm interface ID without updating
the man page.

Fixes: aed63ae1acb9 ("ip xfrm: support setting/printing XFRMA_IF_ID attribute 
in states/policies")
Signed-off-by: Eyal Birger 
---
 man/man8/ip-xfrm.8 | 8 
 1 file changed, 8 insertions(+)

diff --git a/man/man8/ip-xfrm.8 b/man/man8/ip-xfrm.8
index f99f30bb..d717205d 100644
--- a/man/man8/ip-xfrm.8
+++ b/man/man8/ip-xfrm.8
@@ -61,6 +61,8 @@ ip-xfrm \- transform configuration
 .IR EXTRA-FLAG-LIST " ]"
 .RB "[ " output-mark
 .IR OUTPUT-MARK " ]"
+.RB "[ " if_id
+.IR IF-ID " ]"
 
 .ti -8
 .B "ip xfrm state allocspi"
@@ -238,6 +240,8 @@ ip-xfrm \- transform configuration
 .IR PRIORITY " ]"
 .RB "[ " flag
 .IR FLAG-LIST " ]"
+.RB "[ " if_id
+.IR IF-ID " ]"
 .RI "[ " LIMIT-LIST " ] [ " TMPL-LIST " ]"
 
 .ti -8
@@ -561,6 +565,10 @@ used to match xfrm policies and states
 used to set the output mark to influence the routing
 of the packets emitted by the state
 
+.TP
+.I IF-ID
+xfrm interface identifier used to in both xfrm policies and states
+
 .sp
 .PP
 .TS
-- 
2.25.1



[PATCH iproute2,v2 2/2] ip xfrm: policy: support policies with IF_ID in get/delete/deleteall

2020-07-08 Thread Eyal Birger
The XFRMA_IF_ID attribute is set in policies for them to be
associated with an XFRM interface (4.19+).

Add support for getting/deleting policies with this attribute.

For supporting 'deleteall' the XFRMA_IF_ID attribute needs to be
explicitly copied.

Signed-off-by: Eyal Birger 
---
 ip/xfrm_policy.c   | 17 -
 man/man8/ip-xfrm.8 |  2 ++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c
index d3c706d3..7cc00e7c 100644
--- a/ip/xfrm_policy.c
+++ b/ip/xfrm_policy.c
@@ -59,6 +59,7 @@ static void usage(void)
"   [ if_id IF_ID ] [ LIMIT-LIST ] [ TMPL-LIST ]\n"
"Usage: ip xfrm policy { delete | get } { SELECTOR | index 
INDEX } dir DIR\n"
"   [ ctx CTX ] [ mark MARK [ mask MASK ] ] [ ptype PTYPE 
]\n"
+   "   [ if_id IF_ID ]\n"
"Usage: ip xfrm policy { deleteall | list } [ nosock ] [ 
SELECTOR ] [ dir DIR ]\n"
"   [ index INDEX ] [ ptype PTYPE ] [ action ACTION ] [ 
priority PRIORITY ]\n"
"   [ flag FLAG-LIST ]\n"
@@ -582,6 +583,8 @@ static int xfrm_policy_get_or_delete(int argc, char **argv, 
int delete,
struct xfrm_user_sec_ctx sctx;
charstr[CTX_BUF_SIZE];
} ctx = {};
+   bool is_if_id_set = false;
+   __u32 if_id = 0;
 
while (argc > 0) {
if (strcmp(*argv, "dir") == 0) {
@@ -619,7 +622,11 @@ static int xfrm_policy_get_or_delete(int argc, char 
**argv, int delete,
 
NEXT_ARG();
xfrm_policy_ptype_parse(&upt.type, &argc, &argv);
-
+   } else if (strcmp(*argv, "if_id") == 0) {
+   NEXT_ARG();
+   if (get_u32(&if_id, *argv, 0))
+   invarg("IF_ID value is invalid", *argv);
+   is_if_id_set = true;
} else {
if (selp)
invarg("unknown", *argv);
@@ -669,6 +676,9 @@ static int xfrm_policy_get_or_delete(int argc, char **argv, 
int delete,
  (void *)&ctx, ctx.sctx.len);
}
 
+   if (is_if_id_set)
+   addattr32(&req.n, sizeof(req.buf), XFRMA_IF_ID, if_id);
+
if (rtnl_talk(&rth, &req.n, answer) < 0)
exit(2);
 
@@ -767,6 +777,11 @@ static int xfrm_policy_keep(struct nlmsghdr *n, void *arg)
}
}
 
+   if (tb[XFRMA_IF_ID]) {
+   addattr32(new_n, xb->size, XFRMA_IF_ID,
+ rta_getattr_u32(tb[XFRMA_IF_ID]));
+   }
+
xb->offset += new_n->nlmsg_len;
xb->nlmsg_count++;
 
diff --git a/man/man8/ip-xfrm.8 b/man/man8/ip-xfrm.8
index d717205d..aa28db49 100644
--- a/man/man8/ip-xfrm.8
+++ b/man/man8/ip-xfrm.8
@@ -259,6 +259,8 @@ ip-xfrm \- transform configuration
 .IR MASK " ] ]"
 .RB "[ " ptype
 .IR PTYPE " ]"
+.RB "[ " if_id
+.IR IF-ID " ]"
 
 .ti -8
 .BR ip " [ " -4 " | " -6 " ] " "xfrm policy" " { " deleteall " | " list " }"
-- 
2.25.1



[PATCH iproute2,v2 0/2] ip xfrm: policy: support policies with IF_ID in get/delete/deleteall

2020-07-08 Thread Eyal Birger
Allow getting/deleting policies which contain an xfrm interface ID.

First patch fixes the man page with regards to the original addition
of IF-ID in ip xfrm operations.

---

v1 -> v2: update man page

Eyal Birger (2):
  ip xfrm: update man page on setting/printing XFRMA_IF_ID in
states/policies
  ip xfrm: policy: support policies with IF_ID in get/delete/deleteall

 ip/xfrm_policy.c   | 17 -
 man/man8/ip-xfrm.8 | 10 ++
 2 files changed, 26 insertions(+), 1 deletion(-)

-- 
2.25.1



[PATCH iproute2] ip xfrm: policy: support policies with IF_ID in get/delete/deleteall

2020-07-06 Thread Eyal Birger
The XFRMA_IF_ID attribute is set in policies for them to be
associated with an XFRM interface (4.19+).

Add support for getting/deleting policies with this attribute.

For supporting 'deleteall' the XFRMA_IF_ID attribute needs to be
explicitly copied.

Signed-off-by: Eyal Birger 
---
 ip/xfrm_policy.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c
index d3c706d3..7cc00e7c 100644
--- a/ip/xfrm_policy.c
+++ b/ip/xfrm_policy.c
@@ -59,6 +59,7 @@ static void usage(void)
"   [ if_id IF_ID ] [ LIMIT-LIST ] [ TMPL-LIST ]\n"
"Usage: ip xfrm policy { delete | get } { SELECTOR | index 
INDEX } dir DIR\n"
"   [ ctx CTX ] [ mark MARK [ mask MASK ] ] [ ptype PTYPE 
]\n"
+   "   [ if_id IF_ID ]\n"
"Usage: ip xfrm policy { deleteall | list } [ nosock ] [ 
SELECTOR ] [ dir DIR ]\n"
"   [ index INDEX ] [ ptype PTYPE ] [ action ACTION ] [ 
priority PRIORITY ]\n"
"   [ flag FLAG-LIST ]\n"
@@ -582,6 +583,8 @@ static int xfrm_policy_get_or_delete(int argc, char **argv, 
int delete,
struct xfrm_user_sec_ctx sctx;
charstr[CTX_BUF_SIZE];
} ctx = {};
+   bool is_if_id_set = false;
+   __u32 if_id = 0;
 
while (argc > 0) {
if (strcmp(*argv, "dir") == 0) {
@@ -619,7 +622,11 @@ static int xfrm_policy_get_or_delete(int argc, char 
**argv, int delete,
 
NEXT_ARG();
xfrm_policy_ptype_parse(&upt.type, &argc, &argv);
-
+   } else if (strcmp(*argv, "if_id") == 0) {
+   NEXT_ARG();
+   if (get_u32(&if_id, *argv, 0))
+   invarg("IF_ID value is invalid", *argv);
+   is_if_id_set = true;
} else {
if (selp)
invarg("unknown", *argv);
@@ -669,6 +676,9 @@ static int xfrm_policy_get_or_delete(int argc, char **argv, 
int delete,
  (void *)&ctx, ctx.sctx.len);
}
 
+   if (is_if_id_set)
+   addattr32(&req.n, sizeof(req.buf), XFRMA_IF_ID, if_id);
+
if (rtnl_talk(&rth, &req.n, answer) < 0)
exit(2);
 
@@ -767,6 +777,11 @@ static int xfrm_policy_keep(struct nlmsghdr *n, void *arg)
}
}
 
+   if (tb[XFRMA_IF_ID]) {
+   addattr32(new_n, xb->size, XFRMA_IF_ID,
+ rta_getattr_u32(tb[XFRMA_IF_ID]));
+   }
+
xb->offset += new_n->nlmsg_len;
xb->nlmsg_count++;
 
-- 
2.25.1



Re: [PATCH net-next v3 1/4] net: sched: em_ipt: match only on ip/ipv6 traffic

2019-06-27 Thread Eyal Birger
Hi Nik,

On Thu, 27 Jun 2019 11:10:44 +0300
Nikolay Aleksandrov  wrote:

> Restrict matching only to ip/ipv6 traffic and make sure we can use the
> headers, otherwise matches will be attempted on any protocol which can
> be unexpected by the xt matches. Currently policy supports only
> ipv4/6.
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
> v3: no change
> v2: no change
> 
>  net/sched/em_ipt.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/net/sched/em_ipt.c b/net/sched/em_ipt.c
> index 243fd22f2248..64dbafe4e94c 100644
> --- a/net/sched/em_ipt.c
> +++ b/net/sched/em_ipt.c
> @@ -185,6 +185,19 @@ static int em_ipt_match(struct sk_buff *skb,
> struct tcf_ematch *em, struct nf_hook_state state;
>   int ret;
>  
> + switch (tc_skb_protocol(skb)) {
> + case htons(ETH_P_IP):
> + if (!pskb_network_may_pull(skb, sizeof(struct
> iphdr)))
> + return 0;
> + break;
> + case htons(ETH_P_IPV6):
> + if (!pskb_network_may_pull(skb, sizeof(struct
> ipv6hdr)))
> + return 0;
> + break;
> + default:
> + return 0;
> + }
> +

I just realized that I didn't consider the egress direction in my review.
Don't we need an skb_pull() in that direction to make the skb->data point
to L3? I see this is done e.g. in em_ipset.

Eyal.


Re: [PATCH net-next v3 0/4] em_ipt: add support for addrtype

2019-06-27 Thread Eyal Birger
On Thu, 27 Jun 2019 11:10:43 +0300
Nikolay Aleksandrov  wrote:

> Hi,
> We would like to be able to use the addrtype from tc for ACL rules and
> em_ipt seems the best place to add support for the already existing xt
> match. The biggest issue is that addrtype revision 1 (with ipv6
> support) is NFPROTO_UNSPEC and currently em_ipt can't differentiate
> between v4/v6 if such xt match is used because it passes the match's
> family instead of the packet one. The first 3 patches make em_ipt
> match only on IP traffic (currently both policy and addrtype
> recognize such traffic only) and make it pass the actual packet's
> protocol instead of the xt match family when it's unspecified. They
> also add support for NFPROTO_UNSPEC xt matches. The last patch allows
> to add addrtype rules via em_ipt. We need to keep the user-specified
> nfproto for dumping in order to be compatible with libxtables, we
> cannot dump NFPROTO_UNSPEC as the nfproto or we'll get an error from
> libxtables, thus the nfproto is limited to ipv4/ipv6 in patch 03 and
> is recorded.
> 
> v3: don't use the user nfproto for matching, only for dumping, more
> information is available in the commit message in patch 03
> v2: change patch 02 to set the nfproto only when unspecified and drop
> patch 04 from v1 (Eyal Birger)
> 
> Thank you,
>   Nikolay Aleksandrov
> 
> 
> Nikolay Aleksandrov (4):
>   net: sched: em_ipt: match only on ip/ipv6 traffic
>   net: sched: em_ipt: set the family based on the packet if it's
> unspecified
>   net: sched: em_ipt: keep the user-specified nfproto and dump it
>   net: sched: em_ipt: add support for addrtype matching
> 
>  net/sched/em_ipt.c | 48
> ++++++++-- 1 file changed, 46
> insertions(+), 2 deletions(-)
> 

Looks great! thanks for adding this!

For the series:

Acked-by: Eyal Birger 


Re: [PATCH net-next 2/5] net: sched: em_ipt: set the family based on the protocol when matching

2019-06-26 Thread Eyal Birger
On Wed, 26 Jun 2019 16:45:28 +0300
Nikolay Aleksandrov  wrote:

> On 26/06/2019 16:33, Eyal Birger wrote:
> > Hi Nikolay,
> >
> > On Wed, 26 Jun 2019 14:58:52 +0300
> > Nikolay Aleksandrov  wrote:
> >   
> >> Set the family based on the protocol otherwise protocol-neutral
> >> matches will have wrong information (e.g. NFPROTO_UNSPEC). In
> >> preparation for using NFPROTO_UNSPEC xt matches.
> >>
> >> Signed-off-by: Nikolay Aleksandrov 
> >> ---
> >>  net/sched/em_ipt.c | 4 +++-
> >>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>
...
> >> -  nf_hook_state_init(&state, im->hook, im->match->family,
> >> +  nf_hook_state_init(&state, im->hook, state.pf,
> >>   indev ?: skb->dev, skb->dev, NULL,
> >> em->net, NULL); 
> >>acpar.match = im->match;  
> > 
> > I think this change is incompatible with current behavior.
> > 
> > Consider the 'policy' match which matches the packet's xfrm state
> > (sec_path) with the provided user space parameters. The sec_path
> > includes information about the encapsulating packet's parameters
> > whereas the current skb points to the encapsulated packet, and the
> > match is done on the encapsulating packet's info.
> > 
> > So if you have an IPv6 packet encapsulated within an IPv4 packet,
> > the match parameters should be done using IPv4 parameters, not IPv6.
> > 
> > Maybe use the packet's family only if the match family is UNSPEC?
> > 
> > Eyal.
> >   
> 
> Hi Eyal,
> I see your point, I was wondering about the xfrm cases. :)
> In such case I think we can simplify the set and do it only on UNSPEC
> matches as you suggest.
> 
> Maybe we should enforce the tc protocol based on the user-specified
> nfproto at least from iproute2 otherwise people can add mismatching
> rules (e.g. nfproto == v6, tc proto == v4).
> 
Hi Nik,

I think for iproute2 the issue is the same. For encapsulated IPv6 in
IPv4 for example, tc proto will be IPv6 (tc sees the encapsulated
packet after decryption) whereas nfproto will be IPv4 (policy match is
done on the encapsulating state metadata which is IPv4).

I think the part missing in iproute2 is the ability to specify
NFPROTO_UNSPEC.

Thanks,
Eyal



Re: [PATCH net-next v2 3/4] net: sched: em_ipt: keep the user-specified nfproto and use it

2019-06-26 Thread Eyal Birger
Hi Nik,

On Wed, 26 Jun 2019 18:56:14 +0300
Nikolay Aleksandrov  wrote:

> For NFPROTO_UNSPEC xt_matches there's no way to restrict the matching
> to a specific family, in order to do so we record the user-specified
> family and later enforce it while doing the match.
> 
> v2: adjust changes to missing patch, was patch 04 in v1
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
>  net/sched/em_ipt.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
..snip..
> @@ -182,8 +195,8 @@ static int em_ipt_match(struct sk_buff *skb,
> struct tcf_ematch *em, const struct em_ipt_match *im = (const void
> *)em->data; struct xt_action_param acpar = {};
>   struct net_device *indev = NULL;
> - u8 nfproto = im->match->family;
>   struct nf_hook_state state;
> + u8 nfproto = im->nfproto;

Maybe I'm missing something now - but it's not really clear to me now
why keeping im->nfproto would be useful:

If NFPROTO_UNSPEC was provided by userspace then the actual nfproto used
will be taken from the packet, and if NFPROTO_IPV4/IPV6 was specified
from userspace then it will equal im->match->family.

Is there any case where the resulting nfproto would differ as a result
of this patch?

Otherwise the patchset looks excellent to me.

Thanks!
Eyal.



Re: [PATCH net-next 2/5] net: sched: em_ipt: set the family based on the protocol when matching

2019-06-26 Thread Eyal Birger
Hi Nikolay,
   
On Wed, 26 Jun 2019 14:58:52 +0300
Nikolay Aleksandrov  wrote:

> Set the family based on the protocol otherwise protocol-neutral
> matches will have wrong information (e.g. NFPROTO_UNSPEC). In
> preparation for using NFPROTO_UNSPEC xt matches.
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
>  net/sched/em_ipt.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sched/em_ipt.c b/net/sched/em_ipt.c
> index 64dbafe4e94c..23965a071177 100644
> --- a/net/sched/em_ipt.c
> +++ b/net/sched/em_ipt.c
> @@ -189,10 +189,12 @@ static int em_ipt_match(struct sk_buff *skb,
> struct tcf_ematch *em, case htons(ETH_P_IP):
>   if (!pskb_network_may_pull(skb, sizeof(struct
> iphdr))) return 0;
> + state.pf = NFPROTO_IPV4;
>   break;
>   case htons(ETH_P_IPV6):
>   if (!pskb_network_may_pull(skb, sizeof(struct
> ipv6hdr))) return 0;
> + state.pf = NFPROTO_IPV6;
>   break;
>   default:
>   return 0;
> @@ -203,7 +205,7 @@ static int em_ipt_match(struct sk_buff *skb,
> struct tcf_ematch *em, if (skb->skb_iif)
>   indev = dev_get_by_index_rcu(em->net, skb->skb_iif);
>  
> - nf_hook_state_init(&state, im->hook, im->match->family,
> + nf_hook_state_init(&state, im->hook, state.pf,
>  indev ?: skb->dev, skb->dev, NULL,
> em->net, NULL); 
>   acpar.match = im->match;

I think this change is incompatible with current behavior.

Consider the 'policy' match which matches the packet's xfrm state (sec_path)
with the provided user space parameters. The sec_path includes information
about the encapsulating packet's parameters whereas the current skb points to
the encapsulated packet, and the match is done on the encapsulating
packet's info.

So if you have an IPv6 packet encapsulated within an IPv4 packet, the match
parameters should be done using IPv4 parameters, not IPv6.

Maybe use the packet's family only if the match family is UNSPEC?

Eyal.


Re: [PATCH net-next 2/2] net: sched: protect against stack overflow in TC act_mirred

2019-06-25 Thread Eyal Birger
Hi Jamal, John,

On Tue, 25 Jun 2019 07:24:37 -0400
Jamal Hadi Salim  wrote:

> On 2019-06-25 5:06 a.m., John Hurley wrote:
> > On Tue, Jun 25, 2019 at 9:30 AM Eyal Birger 
> > wrote:  
> 
> > I'm not sure on the history of why a value of 4 was selected here
> > but it seems to fall into line with my findings.  
> 
> Back then we could only loop in one direction (as opposed to two right
> now) - so seeing something twice would have been suspect enough,
> so 4 seems to be a good number. I still think 4 is a good number.

I think the introduction of mirred ingress affects the 'seeing something
twice is suspicious' paradigm - see below.

> > Is there a hard requirement for >4 recursive calls here?  
> 
> I think this is where testcases help (which then get permanently
> added in tdc repository). Eyal - if you have a test scenario where
> this could be demonstrated it would help.

I don't have a _hard_ requirement for >4 recursive calls.

I did encounter use cases for 2 layers of stacked net devices using TC
mirred ingress. For example, first layer redirects traffic based on
incoming protocol - e.g. some tunneling criterion - and the second
layer redirects traffic based on the IP packet src/dst, something like:

  +---+  +---+  +---+  +---+
  |ip0|  |ip1|  |ip2|  |ip3|
  +---+  +---+  +---+  +---+
  \  / \   /
   \/   \ /
 +---+ +---+
 |   proto0  | |   proto1  |
 +---+ +---+
\   /
 \ /
+---+
|eth0   |
+---+

Where packets stem from eth0 and are redirected to the appropriate devices
using mirred ingress redirect with different criteria.
This is useful for example when each 'ip' device is part of a different
routing domain.

There are probably many other ways to do this kind of thing, but using mirred
ingress to demux the traffic provides freedom in the demux criteria while
having the benefit of a netdevice at each node allowing to use tcpdump and
other such facilities.

As such, I was concerned that a hard limit of 4 may be restrictive.

I too think Florian's suggestion of using netif_rx() in order to break
the recursion when limit is met (or always use it?) is a good approach
to try in order not to force restrictions while keeping the stack sane.

Eyal.


Re: [PATCH net-next 2/2] net: sched: protect against stack overflow in TC act_mirred

2019-06-25 Thread Eyal Birger
Hi John,

On Mon, 24 Jun 2019 23:13:36 +0100
John Hurley  wrote:

> TC hooks allow the application of filters and actions to packets at
> both ingress and egress of the network stack. It is possible, with
> poor configuration, that this can produce loops whereby an ingress
> hook calls a mirred egress action that has an egress hook that
> redirects back to the first ingress etc. The TC core classifier
> protects against loops when doing reclassifies but there is no
> protection against a packet looping between multiple hooks and
> recursively calling act_mirred. This can lead to stack overflow
> panics.
> 
> Add a per CPU counter to act_mirred that is incremented for each
> recursive call of the action function when processing a packet. If a
> limit is passed then the packet is dropped and CPU counter reset.
> 
> Note that this patch does not protect against loops in TC datapaths.
> Its aim is to prevent stack overflow kernel panics that can be a
> consequence of such loops.
> 
> Signed-off-by: John Hurley 
> Reviewed-by: Simon Horman 
> ---
>  net/sched/act_mirred.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index 8c1d736..c3fce36 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -27,6 +27,9 @@
>  static LIST_HEAD(mirred_list);
>  static DEFINE_SPINLOCK(mirred_list_lock);
>  
> +#define MIRRED_RECURSION_LIMIT4

Could you increase the limit to maybe 6 or 8? I am aware of cases where
mirred ingress is used for cascading several layers of logical network
interfaces and 4 seems a little limiting.

Thanks,
Eyal.


[PATCH iproute2] tc: adjust xtables_match and xtables_target to changes in recent iptables

2019-06-24 Thread Eyal Birger
iptables commit 933400b37d09 ("nft: xtables: add the infrastructure to 
translate from iptables to nft")
added an additional member to struct xtables_match and struct xtables_target.

This change is available for libxtables12 and up.
Add these members conditionally to support both newer and older versions.

Fixes: dd29621578d2 ("tc: add em_ipt ematch for calling xtables matches from tc 
matching context")
Signed-off-by: Eyal Birger 
---
 include/xtables.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/include/xtables.h b/include/xtables.h
index b48c3166..583619f4 100644
--- a/include/xtables.h
+++ b/include/xtables.h
@@ -206,6 +206,24 @@ enum xtables_ext_flags {
XTABLES_EXT_ALIAS = 1 << 0,
 };
 
+#if XTABLES_VERSION_CODE >= 12
+struct xt_xlate;
+
+struct xt_xlate_mt_params {
+   const void  *ip;
+   const struct xt_entry_match *match;
+   int numeric;
+   boolescape_quotes;
+};
+
+struct xt_xlate_tg_params {
+   const void  *ip;
+   const struct xt_entry_target*target;
+   int numeric;
+   boolescape_quotes;
+};
+#endif
+
 /* Include file for additions: new matches and targets. */
 struct xtables_match
 {
@@ -270,6 +288,12 @@ struct xtables_match
void (*x6_fcheck)(struct xt_fcheck_call *);
const struct xt_option_entry *x6_options;
 
+#if XTABLES_VERSION_CODE >= 12
+   /* Translate iptables to nft */
+   int (*xlate)(struct xt_xlate *xl,
+const struct xt_xlate_mt_params *params);
+#endif
+
/* Size of per-extension instance extra "global" scratch space */
size_t udata_size;
 
@@ -347,6 +371,12 @@ struct xtables_target
void (*x6_fcheck)(struct xt_fcheck_call *);
const struct xt_option_entry *x6_options;
 
+#if XTABLES_VERSION_CODE >= 12
+   /* Translate iptables to nft */
+   int (*xlate)(struct xt_xlate *xl,
+const struct xt_xlate_tg_params *params);
+#endif
+
size_t udata_size;
 
/* Ignore these men behind the curtain: */
-- 
2.17.1



Re: [PATCH iproute2-next] ip xfrm: support setting/printing XFRMA_IF_ID attribute in states/policies

2019-04-05 Thread Eyal Birger
Hi Stephen,

On Thu, 4 Apr 2019 10:13:36 -0700
Stephen Hemminger  wrote:

> On Thu,  4 Apr 2019 19:07:38 +0300
> Eyal Birger  wrote:
> 
> > The XFRMA_IF_ID attribute is set in policies/states for them to be
> > associated with an XFRM interface (4.19+).
> > 
> > Add support for setting / displaying this attribute.
> > 
> > Note that 0 is a valid value therefore set XFRMA_IF_ID if any value
> > was provided in command line.
> > 
> > Tested-by: Antony Antony 
> > Signed-off-by: Eyal Birger 
> > ---  
> 
> Since this works with existing kernel int should not need to wait for
> next.

Oh, sorry about that.

The patch applies cleanly and works on the iproute2 tree. Should I
resubmit?

Thanks!
Eyal.


[PATCH iproute2-next] ip xfrm: support setting/printing XFRMA_IF_ID attribute in states/policies

2019-04-04 Thread Eyal Birger
The XFRMA_IF_ID attribute is set in policies/states for them to be
associated with an XFRM interface (4.19+).

Add support for setting / displaying this attribute.

Note that 0 is a valid value therefore set XFRMA_IF_ID if any value
was provided in command line.

Tested-by: Antony Antony 
Signed-off-by: Eyal Birger 
---
 ip/ipxfrm.c  |  8 
 ip/xfrm_policy.c | 12 +++-
 ip/xfrm_state.c  | 11 +++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c
index b153b863..32f56093 100644
--- a/ip/ipxfrm.c
+++ b/ip/ipxfrm.c
@@ -891,6 +891,14 @@ void xfrm_xfrma_print(struct rtattr *tb[], __u16 family,
(xuo->flags & XFRM_OFFLOAD_INBOUND) ? "in" : "out");
fprintf(fp, "%s", _SL_);
}
+   if (tb[XFRMA_IF_ID]) {
+   __u32 if_id = rta_getattr_u32(tb[XFRMA_IF_ID]);
+
+   if (prefix)
+   fputs(prefix, fp);
+   fprintf(fp, "if_id %#x", if_id);
+   fprintf(fp, "%s", _SL_);
+   }
 }
 
 static int xfrm_selector_iszero(struct xfrm_selector *s)
diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c
index feccaada..4a63e9ab 100644
--- a/ip/xfrm_policy.c
+++ b/ip/xfrm_policy.c
@@ -55,7 +55,7 @@ static void usage(void)
fprintf(stderr, "Usage: ip xfrm policy { add | update } SELECTOR dir 
DIR [ ctx CTX ]\n");
fprintf(stderr, "[ mark MARK [ mask MASK ] ] [ index INDEX ] [ 
ptype PTYPE ]\n");
fprintf(stderr, "[ action ACTION ] [ priority PRIORITY ] [ flag 
FLAG-LIST ]\n");
-   fprintf(stderr, "[ LIMIT-LIST ] [ TMPL-LIST ]\n");
+   fprintf(stderr, "[ if_id IF_ID ] [ LIMIT-LIST ] [ TMPL-LIST 
]\n");
fprintf(stderr, "Usage: ip xfrm policy { delete | get } { SELECTOR | 
index INDEX } dir DIR\n");
fprintf(stderr, "[ ctx CTX ] [ mark MARK [ mask MASK ] ] [ 
ptype PTYPE ]\n");
fprintf(stderr, "Usage: ip xfrm policy { deleteall | list } [ nosock ] 
[ SELECTOR ] [ dir DIR ]\n");
@@ -270,6 +270,8 @@ static int xfrm_policy_modify(int cmd, unsigned int flags, 
int argc, char **argv
struct xfrm_user_sec_ctx sctx;
charstr[CTX_BUF_SIZE];
} ctx = {};
+   bool is_if_id_set = false;
+   __u32 if_id = 0;
 
while (argc > 0) {
if (strcmp(*argv, "dir") == 0) {
@@ -338,6 +340,11 @@ static int xfrm_policy_modify(int cmd, unsigned int flags, 
int argc, char **argv
xfrm_tmpl_parse(tmpl, &argc, &argv);
 
tmpls_len += sizeof(*tmpl);
+   } else if (strcmp(*argv, "if_id") == 0) {
+   NEXT_ARG();
+   if (get_u32(&if_id, *argv, 0))
+   invarg("IF_ID value is invalid", *argv);
+   is_if_id_set = true;
} else {
if (selp)
duparg("unknown", *argv);
@@ -380,6 +387,9 @@ static int xfrm_policy_modify(int cmd, unsigned int flags, 
int argc, char **argv
  (void *)&ctx, ctx.sctx.len);
}
 
+   if (is_if_id_set)
+   addattr32(&req.n, sizeof(req.buf), XFRMA_IF_ID, if_id);
+
if (rtnl_open_byproto(&rth, 0, NETLINK_XFRM) < 0)
exit(1);
 
diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 09292da9..93601437 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -62,6 +62,7 @@ static void usage(void)
fprintf(stderr, "[ coa ADDR[/PLEN] ] [ ctx CTX ] [ extra-flag 
EXTRA-FLAG-LIST ]\n");
fprintf(stderr, "[ offload [dev DEV] dir DIR ]\n");
fprintf(stderr, "[ output-mark OUTPUT-MARK ]\n");
++  fprintf(stderr, "[ if_id IF_ID ]\n");
fprintf(stderr, "Usage: ip xfrm state allocspi ID [ mode MODE ] [ mark 
MARK [ mask MASK ] ]\n");
fprintf(stderr, "[ reqid REQID ] [ seq SEQ ] [ min SPI max SPI 
]\n");
fprintf(stderr, "Usage: ip xfrm state { delete | get } ID [ mark MARK [ 
mask MASK ] ]\n");
@@ -326,6 +327,8 @@ static int xfrm_state_modify(int cmd, unsigned int flags, 
int argc, char **argv)
charstr[CTX_BUF_SIZE];
} ctx = {};
__u32 output_mark = 0;
+   bool is_if_id_set = false;
+   __u32 if_id = 0;
 
while (argc > 0) {
if (strcmp(*argv, "mode") == 0) {
@@ -445,6 +448,11 @@ static int xfrm_state_modify(int cmd, unsigned int flags, 
int argc, char **argv)
NEXT_ARG();
if (get_u32(&output_mark, *argv, 0))
invarg("value after \"o

Re: [PATCH ipsec-next 03/11] xfrm: remove input indirection from xfrm_mode

2019-03-29 Thread Eyal Birger
Hi Florian,

On Wed, Mar 27, 2019 at 7:31 PM Florian Westphal  wrote:
>
> No need for any indirection or abstraction here, both functions
> are pretty much the same and quite small, they also have no external
> dependencies.
>.
> diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
> index ea5ac053c15d..e3c7edd11e77 100644
> --- a/net/xfrm/xfrm_input.c
> +++ b/net/xfrm/xfrm_input.c
> @@ -186,6 +186,74 @@ int xfrm_prepare_input(struct xfrm_state *x, struct 
> sk_buff *skb)
>  }
>  EXPORT_SYMBOL(xfrm_prepare_input);

nit: looks like xfrm_prepare_input() should be static now, no?

Eyal.


Re: [PATCH ipsec, resend 1/1] xfrm: Make set-mark default behavior backward compatible

2019-01-14 Thread Eyal Birger
Hi Benedict,

On Mon, 14 Jan 2019 11:24:38 -0800
Benedict Wong  wrote:

> Fixes 9b42c1f179a6, which changed the default route lookup behavior
> for tunnel mode SAs in the outbound direction to use the skb mark,
> whereas previously mark=0 was used if the output mark was
> unspecified. In mark-based routing schemes such as Android’s, this
> change in default behavior causes routing loops or lookup failures.
> 
> This patch restores the default behavior of using a 0 mark while still
> incorporating the skb mark if the SET_MARK (and SET_MARK_MASK) is
> specified.

It's a little sad that we didn't distinguish between 'no-mark-provided'
and 'mark = mask = 0'. IIUC before this fix, explicitly setting mark,
mask to a) mark = 0, mask = 0x and b) mark = 0, mask = 0 behaved
differently, whereas after this fix they will act the same.

But as it seems there was never support for the (b) scenario and commit
9b42c1f179a6 broke the existing behavior, so I'm ok with this fix.

Thanks!
Eyal.

> 
> Tested with additions to Android's kernel unit test suite:
> https://android-review.googlesource.com/c/kernel/tests/+/860150
> 
> Fixes: 9b42c1f179a6 ("xfrm: Extend the output_mark to support input
> direction and masking") Signed-off-by: Benedict Wong
>  ---
>  net/xfrm/xfrm_policy.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index 934492bad8e0..5f574ede1332 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -2600,7 +2600,10 @@ static struct dst_entry
> *xfrm_bundle_create(struct xfrm_policy *policy,
> dst_copy_metrics(dst1, dst); 
>   if (xfrm[i]->props.mode != XFRM_MODE_TRANSPORT) {
> - __u32 mark = xfrm_smark_get(fl->flowi_mark,
> xfrm[i]);
> + __u32 mark = 0;
> +
> + if (xfrm[i]->props.smark.v ||
> xfrm[i]->props.smark.m)
> + mark =
> xfrm_smark_get(fl->flowi_mark, xfrm[i]); 
>   family = xfrm[i]->props.family;
>   dst = xfrm_dst_lookup(xfrm[i], tos,
> fl->flowi_oif,



Re: [iproute2-next] tc: fix xtables incorrect usage of LDFLAGS

2018-12-12 Thread Eyal Birger
On Wed, 12 Dec 2018 19:35:08 +0800
Syrone Wong  wrote:

> The incorrect setting of LDFLAGS causes error below:
> 
> > em_ipt.o: In function `em_ipt_print_epot':
> > em_ipt.c:(.text.em_ipt_print_epot+0x2e): undefined reference to
> > `xtables_init_all'  
> 
> em_ipt.c gets involved when TC_CONFIG_XT=y, which requires xtables,
> while tc/Makefile doesn't pass flags correctly. It adds '-lxtables'
> to LDFLAGS instead of LDLIBS.
> 
> Fixes: dd296215 ("tc: add em_ipt ematch for calling xtables matches
> from tc matching context")
> 
> Signed-off-by: Syrone Wong 

Thanks!

So iiuc there's some other LDFLAG coming after -lxtables on some
platform? sorry about that..

You can add my:

Acked-by: Eyal Birger 

Eyal.


Re: [PATCH net-next 4/4] act_mirred: use ACT_REDIRECT when possible

2018-07-17 Thread Eyal Birger
Hi,

On Mon, 16 Jul 2018 16:39:55 -0700
Cong Wang  wrote:

> On Fri, Jul 13, 2018 at 2:55 AM Paolo Abeni  wrote:
> >
> > When mirred is invoked from the ingress path, and it wants to
> > redirect the processed packet, it can now use the ACT_REDIRECT
> > action, filling the tcf_result accordingly.
> >
> > This avoids a skb_clone() in the TC S/W data path giving a ~10%
> > improvement in forwarding performances. Overall TC S/W performances
> > are now comparable to the kernel openswitch datapath.  
> 
> Avoiding skb_clone() for redirection is cool, but why need to use
> skb_do_redirect() here?
> 
> There is a subtle difference here:
> 
> skb_do_redirect() calls __bpf_rx_skb() which calls
> dev_forward_skb().
> 
> while the current mirred action doesn't scrub packets when
> redirecting to ingress (from egress). Although I forget if it is
> intentionally.
> 
> Also, skb->skb_iif is unset in skb_do_redirect() when
> redirecting to ingress, I recall we have to set it correctly
> for input routing. Probably yet another reason why we
> can't scrub it, unless my memory goes wrong. :)

Also dev_forward_skb() enforces MTU checks on the packet which are not
done at the moment. I am aware of deployments where this would break
things.

Eyal.


Re: [PATCH ipsec-next] xfrm: Allow Set Mark to be Updated Using UPDSA

2018-07-16 Thread Eyal Birger
On Mon, 16 Jul 2018 15:27:26 -0700
Nathan Harold  wrote:

> < re-sent with apologies due to incorrect formatting last
> time... :-( >
> 
> Hi Eyal,
> 
> > If x1 points to a state previously found using
> > __xfrm_state_locate(x), won't __xfrm_state_bump_genids(x1) be
> > equivalent to x1->genid++ in this case?  
> 
> In the vanilla case this is true. IE, if there are no strange/abusive
> uses of the API such as the test below where multiple SAs can match
> the locate().
> 
> > Is it possible that other states will match all of x1 parameters?  
> 
> Yes.  Not sure if it's a bug or a feature, but it's possible for
> multiple SAs to match... for a depressing example, check out
> https://android-review.googlesource.com/c/kernel/tests/+/680958. There
> may be cases where something like this is desired behavior that I'm
> not aware of. Since this is control path, it felt to me like the
> formalism of using the xfrm_state_bump_genids() was worth not possibly
> walking into a different subtle bug later.

Ok. This is indeed depressing and also unexpected.

I wonder if this behavior could be fixed... I'd find it odd if anyone
is relying on being to able to delete a 'no mark' state by supplying
parameters that do include an explicit mark. I have no idea if anyone
is relying on the state insertion order wrt marks - though it would
seem odd to me as well -- obviously such a change is unrelated to this
patch.

I now better understand the need to be cautious.

> 
> > Also, any idea why this isn't needed for other changes in the
> > state?  
> 
> The set_mark (output_mark) is somewhat special because changing this
> mark impacts the routing lookup, which up to now, none of the other
> parameters in the update_sa function do. A new output_mark can and
> will reroute packets to different interfaces. Thus, when we change
> this thing, we want to ensure that we always build a new bundle with a
> new bundle with a new route lookup based on the new set_mark. Since we
> removed the flow cache, things might *incidentally* seem to work right
> now; but, I think that's incidental rather than correct. By bumping
> the genid, we get the dst_entry->check() function to correctly return
> that the dst is obsolete when we call check(). I'm honestly not sure
> what corner cases we could land in if we didn't bump the genid in such
> a case.
> 
> There's definitely a lot going on behind the scenes in this little
> change that I only tenuously grasp, so it's possible that I'm being
> overly cautious in this case. Please let me know your further thoughts
> on whether we need to bump the genid. FYI once this patch is settled,
> I plan to upload a patch to update the xfrm_if_id, which I planned to
> nestle in to this same logic (and with similar, albeit possibly
> more-straightforward rationale).

Thanks so much for the clarification. Indeed there are nuances here and
I appreciate you taking the time to describe them.

FWIW you can add my:

Reviewed-by: Eyal Birger 

Thanks!
Eyal.


Re: [PATCH ipsec-next] xfrm: Allow Set Mark to be Updated Using UPDSA

2018-07-02 Thread Eyal Birger
Hi Nathan,

On Fri, 29 Jun 2018 15:07:10 -0700
Nathan Harold  wrote:

> Allow UPDSA to change "set mark" to permit
> policy separation of packet routing decisions from
> SA keying in systems that use mark-based routing.
> 
> The set mark, used as a routing and firewall mark
> for outbound packets, is made update-able which
> allows routing decisions to be handled independently
> of keying/SA creation. To maintain consistency with
> other optional attributes, the set mark is only
> updated if sent with a non-zero value.
> 
> The per-SA lock and the xfrm_state_lock are taken in
> that order to avoid a deadlock with
> xfrm_timer_handler(), which also takes the locks in
> that order.
> 
> Signed-off-by: Nathan Harold 
> Change-Id: Ia05c6733a94c1901cd1e54eb7c7e237704678d71
> ---
>  net/xfrm/xfrm_state.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index e04a510ec992..c9ffcdfa89f6 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -1562,6 +1562,15 @@ int xfrm_state_update(struct xfrm_state *x)
>   if (x1->curlft.use_time)
>   xfrm_state_check_expire(x1);
>  
> + if (x->props.smark.m || x->props.smark.v) {
> + spin_lock_bh(&net->xfrm.xfrm_state_lock);
> +
> + x1->props.smark = x->props.smark;
> +
> + __xfrm_state_bump_genids(x1);

So I'm trying to wrap my head around this genid thing :)

If x1 points to a state previously found using __xfrm_state_locate(x),
won't __xfrm_state_bump_genids(x1) be equivalent to x1->genid++ in
this case?

Is it possible that other states will match all of x1 parameters?

Also, any idea why this isn't needed for other changes in the state?

Thanks!
Eyal.


Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.

2018-06-14 Thread Eyal Birger



> On 14 Jun 2018, at 15:01, William Tu  wrote:
> 
> Make the printting of bpf xfrm tunnel better and
> cleanup xfrm state and policy when xfrm test finishes.

Yeah the ‘tee’ was useful when developing the test - I could see what’s going 
on :)

Now that it’s in ‘selftests’ it’s definitely better without it.

Thanks for the cleanup!
Eyal.


[PATCH ipsec] vti6: fix PMTU caching and reporting on xmit

2018-06-07 Thread Eyal Birger
When setting the skb->dst before doing the MTU check, the route PMTU
caching and reporting is done on the new dst which is about to be
released.

Instead, PMTU handling should be done using the original dst.

This is aligned with IPv4 VTI.

Signed-off-by: Eyal Birger 
Fixes: ccd740cbc6 ("vti6: Add pmtu handling to vti6_xmit.")
---
 net/ipv6/ip6_vti.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index ca957dd..e675ec7 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -480,10 +480,6 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
goto tx_err_dst_release;
}
 
-   skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev)));
-   skb_dst_set(skb, dst);
-   skb->dev = skb_dst(skb)->dev;
-
mtu = dst_mtu(dst);
if (!skb->ignore_df && skb->len > mtu) {
skb_dst_update_pmtu(skb, mtu);
@@ -498,9 +494,14 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
  htonl(mtu));
}
 
-   return -EMSGSIZE;
+   err = -EMSGSIZE;
+   goto tx_err_dst_release;
}
 
+   skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev)));
+   skb_dst_set(skb, dst);
+   skb->dev = skb_dst(skb)->dev;
+
err = dst_output(t->net, skb->sk, skb);
if (net_xmit_eval(err) == 0) {
struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
-- 
2.7.4



Re: [PATCH ipsec-next] xfrm: Allow Output Mark to be Updated Using UPDSA

2018-05-09 Thread Eyal Birger
Hi Nathan,

On Wed,  9 May 2018 13:46:26 -0700
Nathan Harold  wrote:

> Allow UPDSA to change output_mark to permit
> policy separation of packet routing decisions from
> SA keying in systems that use mark-based routing.
> 
> In the output_mark, used as a routing and firewall
> mark for outbound packets, is made update-able which
> allows routing decisions to be handled independently
> of keying/SA creation. To maintain consistency with
> other optional attributes, the output mark is only
> updated if sent with a non-zero value. Once set, the
> output mark may not be reset to zero, which ensures
> that updating the SA does not require the mark to
> be re-sent to avoid the value being clobbered.

There is an attempt to extend the 'output_mark' to support the input
direction and masking.

In the proposed implementation, output_mark is converted to type 'struct
xfrm_mark' where the semantics are as follows:

- If mark is given by XFRMA_OUTPUT_MARK (renamed to XFRMA_SET_MARK)
  then a new XFRMA_SET_MARK_MASK attribute is consulted to set the mask
  value
- if no XFRMA_SET_MARK_MASK attribute is provided, the mask is set to
  0x

Therefore, if the mask value is 0, we can regard the mark as 'not
given'.

My question is, in the context of this patch, it seems that the
"Once set, the output mark may not be reset to zero" restriction may be
lifted in favor of updating the mark only if the new mask is non zero.

Does this make sense to you?
Eyal


[PATCH bpf-next,v3 1/2] bpf: add helper for getting xfrm states

2018-04-24 Thread Eyal Birger
This commit introduces a helper which allows fetching xfrm state
parameters by eBPF programs attached to TC.

Prototype:
bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)

skb: pointer to skb
index: the index in the skb xfrm_state secpath array
xfrm_state: pointer to 'struct bpf_xfrm_state'
size: size of 'struct bpf_xfrm_state'
flags: reserved for future extensions

The helper returns 0 on success. Non zero if no xfrm state at the index
is found - or non exists at all.

struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
address and the reqid; it can be further extended by adding elements to
its end - indicating the populated fields by the 'size' argument -
keeping backwards compatibility.

Typical usage:

struct bpf_xfrm_state x = {};
bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
...

Signed-off-by: Eyal Birger 
---
 include/uapi/linux/bpf.h | 25 -
 net/core/filter.c| 48 
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c8383a2..e667939 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -774,6 +774,15 @@ union bpf_attr {
  * @xdp_md: pointer to xdp_md
  * @delta: A negative integer to be added to xdp_md.data_end
  * Return: 0 on success or negative on error
+ *
+ * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
+ * retrieve XFRM state
+ * @skb: pointer to skb
+ * @index: index of the xfrm state in the secpath
+ * @key: pointer to 'struct bpf_xfrm_state'
+ * @size: size of 'struct bpf_xfrm_state'
+ * @flags: room for future extensions
+ * Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -841,7 +850,8 @@ union bpf_attr {
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
FN(bind),   \
-   FN(xdp_adjust_tail),
+   FN(xdp_adjust_tail),\
+   FN(skb_get_xfrm_state),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -947,6 +957,19 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* user accessible mirror of in-kernel xfrm_state.
+ * new fields can only be added to the end of this structure
+ */
+struct bpf_xfrm_state {
+   __u32 reqid;
+   __u32 spi;  /* Stored in network byte order */
+   __u16 family;
+   union {
+   __u32 remote_ipv4;  /* Stored in network byte order */
+   __u32 remote_ipv6[4];   /* Stored in network byte order */
+   };
+};
+
 /* Generic BPF return codes which all BPF program types may support.
  * The values are binary compatible with their TC_ACT_* counter-part to
  * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
diff --git a/net/core/filter.c b/net/core/filter.c
index e25bc4a..8e45c6c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /**
@@ -3743,6 +3744,49 @@ static const struct bpf_func_proto bpf_bind_proto = {
.arg3_type  = ARG_CONST_SIZE,
 };
 
+#ifdef CONFIG_XFRM
+BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index,
+  struct bpf_xfrm_state *, to, u32, size, u64, flags)
+{
+   const struct sec_path *sp = skb_sec_path(skb);
+   const struct xfrm_state *x;
+
+   if (!sp || unlikely(index >= sp->len || flags))
+   goto err_clear;
+
+   x = sp->xvec[index];
+
+   if (unlikely(size != sizeof(struct bpf_xfrm_state)))
+   goto err_clear;
+
+   to->reqid = x->props.reqid;
+   to->spi = x->id.spi;
+   to->family = x->props.family;
+   if (to->family == AF_INET6) {
+   memcpy(to->remote_ipv6, x->props.saddr.a6,
+  sizeof(to->remote_ipv6));
+   } else {
+   to->remote_ipv4 = x->props.saddr.a4;
+   }
+
+   return 0;
+err_clear:
+   memset(to, 0, size);
+   return -EINVAL;
+}
+
+static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
+   .func   = bpf_skb_get_xfrm_state,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_ANYTHING,
+   .arg3_type  = ARG_PTR_TO_UNINIT_MEM,
+   .arg4_type  = ARG_CONST_SIZE,
+   .arg5_type  = ARG_ANYTHING,
+};
+#endif
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -3884,6 +3928,10 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return &bpf_get_socket_cookie_proto;
case BPF_FUNC_get_socket_ui

[PATCH bpf-next,v3 2/2] samples/bpf: extend test_tunnel_bpf.sh with xfrm state test

2018-04-24 Thread Eyal Birger
Add a test for fetching xfrm state parameters from a tc program running
on ingress.

Signed-off-by: Eyal Birger 
---
 samples/bpf/tcbpf2_kern.c | 16 +++
 samples/bpf/test_tunnel_bpf.sh| 71 +++
 tools/include/uapi/linux/bpf.h| 25 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 4 files changed, 114 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 9a8db7bd..fa260c7 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -593,4 +593,20 @@ int _ip6ip6_get_tunnel(struct __sk_buff *skb)
return TC_ACT_OK;
 }
 
+SEC("xfrm_get_state")
+int _xfrm_get_state(struct __sk_buff *skb)
+{
+   struct bpf_xfrm_state x;
+   char fmt[] = "reqid %d spi 0x%x remote ip 0x%x\n";
+   int ret;
+
+   ret = bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
+   if (ret < 0)
+   return TC_ACT_OK;
+
+   bpf_trace_printk(fmt, sizeof(fmt), x.reqid, bpf_ntohl(x.spi),
+bpf_ntohl(x.remote_ipv4));
+   return TC_ACT_OK;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index c265863..9c534dc 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -155,6 +155,57 @@ function add_ipip_tunnel {
ip addr add dev $DEV 10.1.1.200/24
 }
 
+function setup_xfrm_tunnel {
+   auth=0x$(printf '1%.0s' {1..40})
+   enc=0x$(printf '2%.0s' {1..32})
+   spi_in_to_out=0x1
+   spi_out_to_in=0x2
+   # in namespace
+   # in -> out
+   ip netns exec at_ns0 \
+   ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
+   spi $spi_in_to_out reqid 1 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+   ip netns exec at_ns0 \
+   ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir out \
+   tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
+   mode tunnel
+   # out -> in
+   ip netns exec at_ns0 \
+   ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
+   spi $spi_out_to_in reqid 2 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+   ip netns exec at_ns0 \
+   ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir in \
+   tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
+   mode tunnel
+   # address & route
+   ip netns exec at_ns0 \
+   ip addr add dev veth0 10.1.1.100/32
+   ip netns exec at_ns0 \
+   ip route add 10.1.1.200 dev veth0 via 172.16.1.200 \
+   src 10.1.1.100
+
+   # out of namespace
+   # in -> out
+   ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
+   spi $spi_in_to_out reqid 1 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
+   ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir in \
+   tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
+   mode tunnel
+   # out -> in
+   ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
+   spi $spi_out_to_in reqid 2 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
+   ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir out \
+   tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
+   mode tunnel
+   # address & route
+   ip addr add dev veth1 10.1.1.200/32
+   ip route add 10.1.1.100 dev veth1 via 172.16.1.100 src 10.1.1.200
+}
+
 function attach_bpf {
DEV=$1
SET_TUNNEL=$2
@@ -278,6 +329,22 @@ function test_ipip {
cleanup
 }
 
+function test_xfrm_tunnel {
+   config_device
+tcpdump -nei veth1 ip &
+   output=$(mktemp)
+   cat /sys/kernel/debug/tracing/trace_pipe | tee $output &
+setup_xfrm_tunnel
+   tc qdisc add dev veth1 clsact
+   tc filter add dev veth1 proto ip ingress bpf da obj tcbpf2_kern.o \
+   sec xfrm_get_state
+   ip netns exec at_ns0 ping -c 1 10.1.1.200
+   grep "reqid 1" $output
+   grep "spi 0x1" $output
+   grep "remote ip 0xac100164" $output
+   cleanup
+}
+
 function cleanup {
set +ex
pkill iperf
@@ -291,6 +358,8 @@ function cleanup {
ip link del geneve11
ip link del erspan11
ip link del ip6erspan11
+   ip x s flush
+   ip x p flush
pkill tcpdump
pkill cat
set -ex
@@ -316,4 +385,6 @@ echo "T

[PATCH bpf-next,v3 0/2] bpf: add helper for getting xfrm states

2018-04-24 Thread Eyal Birger
This patchset adds support for fetching XFRM state information from
an eBPF program called from TC.

The first patch introduces a helper for fetching an XFRM state from the
skb's secpath. The XFRM state is modeled using a new virtual struct which
contains the SPI, peer address, and reqid values of the state; This struct
can be extended in the future to provide additional state information.

The second patch adds a test example in test_tunnel_bpf.sh. The sample
validates the correct extraction of state information by the eBPF program.

---
v3:
  - Kept SPI and peer IPv4 address in state in network byte order
following suggestion from Alexei Starovoitov
v2:
  - Fixed two comments by Daniel Borkmann:
- disallow reserved flags in helper call
- avoid compiling in helper code when CONFIG_XFRM is off

Eyal Birger (2):
  bpf: add helper for getting xfrm states
  samples/bpf: extend test_tunnel_bpf.sh with xfrm state test

 include/uapi/linux/bpf.h  | 25 ++-
 net/core/filter.c | 48 +
 samples/bpf/tcbpf2_kern.c | 16 +++
 samples/bpf/test_tunnel_bpf.sh| 71 +++
 tools/include/uapi/linux/bpf.h| 25 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 6 files changed, 186 insertions(+), 3 deletions(-)

-- 
2.7.4



Re: [PATCH bpf-next,v2 1/2] bpf: add helper for getting xfrm states

2018-04-19 Thread Eyal Birger
Hi,

On Wed, 18 Apr 2018 15:31:03 -0700
Alexei Starovoitov  wrote:

> On Thu, Apr 19, 2018 at 12:58:22AM +0300, Eyal Birger wrote:
> > This commit introduces a helper which allows fetching xfrm state
> > parameters by eBPF programs attached to TC.
> > 
> > Prototype:
> > bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
> > 
> > skb: pointer to skb
> > index: the index in the skb xfrm_state secpath array
> > xfrm_state: pointer to 'struct bpf_xfrm_state'
> > size: size of 'struct bpf_xfrm_state'
> > flags: reserved for future extensions
> > 


 
> > +#ifdef CONFIG_XFRM
> > +BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32,
> > index,
> > +  struct bpf_xfrm_state *, to, u32, size, u64, flags)
> > +{
> > +   const struct sec_path *sp = skb_sec_path(skb);
> > +   const struct xfrm_state *x;
> > +
> > +   if (!sp || unlikely(index >= sp->len || flags))
> > +   goto err_clear;
> > +
> > +   x = sp->xvec[index];
> > +
> > +   if (unlikely(size != sizeof(struct bpf_xfrm_state)))
> > +   goto err_clear;
> > +
> > +   to->reqid = x->props.reqid;
> > +   to->spi = be32_to_cpu(x->id.spi);
> > +   to->family = x->props.family;
> > +   if (to->family == AF_INET6) {
> > +   memcpy(to->remote_ipv6, x->props.saddr.a6,
> > +  sizeof(to->remote_ipv6));
> > +   } else {
> > +   to->remote_ipv4 = be32_to_cpu(x->props.saddr.a4);
> > +   }  
> 
> that looks inconsistent. Why v4 is cpu endian, but v6 not?

I agree. I followed the reference in bpf_skb_get_tunnel_key(). 
I can keep v4 in net endianess too.

> Why change endianness of the spi?

I felt it was more consistent with other fields and usually helpful for
programs. I can keep it in network order.

In which case, do you expect it to be typed as __be32 in bpf.h?
(I haven't seen other cases)?

Thanks for your feedback!


Re: [PATCH bpf-next 1/2] bpf: add helper for getting xfrm states

2018-04-18 Thread Eyal Birger
On Wed, 18 Apr 2018 22:59:27 +0200
Daniel Borkmann  wrote:

> On 04/17/2018 06:48 AM, Eyal Birger wrote:
> > This commit introduces a helper which allows fetching xfrm state
> > parameters by eBPF programs attached to TC.
> > 
> > Prototype:
> > bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
> > 
> > skb: pointer to skb
> > index: the index in the skb xfrm_state secpath array
> > xfrm_state: pointer to 'struct bpf_xfrm_state'
> > size: size of 'struct bpf_xfrm_state'
> > flags: reserved for future extensions
> > 
> > The helper returns 0 on success. Non zero if no xfrm state at the
> > index is found - or non exists at all.
> > 
> > struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
> > address and the reqid; it can be further extended by adding
> > elements to its end - indicating the populated fields by the 'size'
> > argument - keeping backwards compatibility.
> > 
> > Typical usage:
> > 
> > struct bpf_xfrm_state x = {};
> > bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
> > ...
> > 
> > Signed-off-by: Eyal Birger   
> 
> Patch looks good to me, two comments below:

Thanks! I incorporated your suggestions in v2.
Eyal.

> 
> > ---
> >  include/uapi/linux/bpf.h | 25 -
> >  net/core/filter.c| 46
> > ++ 2 files changed, 70
> > insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index c5ec897..132e172 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -755,6 +755,15 @@ union bpf_attr {
> >   * @addr: pointer to struct sockaddr to bind socket to
> >   * @addr_len: length of sockaddr structure
> >   * Return: 0 on success or negative error code
> > + *
> > + * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
> > + * retrieve XFRM state
> > + * @skb: pointer to skb
> > + * @index: index of the xfrm state in the secpath
> > + * @key: pointer to 'struct bpf_xfrm_state'
> > + * @size: size of 'struct bpf_xfrm_state'
> > + * @flags: room for future extensions
> > + * Return: 0 on success or negative error
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)  \
> > FN(unspec), \
> > @@ -821,7 +830,8 @@ union bpf_attr {
> > FN(msg_apply_bytes),\
> > FN(msg_cork_bytes), \
> > FN(msg_pull_data),  \
> > -   FN(bind),
> > +   FN(bind),   \
> > +   FN(skb_get_xfrm_state),
> >  
> >  /* integer value in 'imm' field of BPF_CALL instruction selects
> > which helper
> >   * function eBPF program intends to call
> > @@ -927,6 +937,19 @@ struct bpf_tunnel_key {
> > __u32 tunnel_label;
> >  };
> >  
> > +/* user accessible mirror of in-kernel xfrm_state.
> > + * new fields can only be added to the end of this structure
> > + */
> > +struct bpf_xfrm_state {
> > +   __u32 reqid;
> > +   __u32 spi;
> > +   __u16 family;
> > +   union {
> > +   __u32 remote_ipv4;
> > +   __u32 remote_ipv6[4];
> > +   };
> > +};
> > +
> >  /* Generic BPF return codes which all BPF program types may
> > support.
> >   * The values are binary compatible with their TC_ACT_*
> > counter-part to
> >   * provide backwards compatibility with existing SCHED_CLS and
> > SCHED_ACT diff --git a/net/core/filter.c b/net/core/filter.c
> > index d31aff9..c06600a 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -57,6 +57,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  
> >  /**
> > @@ -3703,6 +3704,49 @@ static const struct bpf_func_proto
> > bpf_bind_proto = { .arg3_type   = ARG_CONST_SIZE,
> >  };
> >  
> > +BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32,
> > index,
> > +  struct bpf_xfrm_state *, to, u32, size, u64, flags)
> > +{
> > +#ifdef CONFIG_XFRM
> > +   const struct sec_path *sp = skb_sec_path(skb);
> > +   const struct xfrm_state *x;
> > +
> > +   if (!sp || index >= sp->len)  
> 
> This should be something like: if (!sp || unlikely(index >= sp->len
> || flags)) Such that we unconditionally bail out on any flags
> currently, since this is reserved for 

[PATCH bpf-next,v2 1/2] bpf: add helper for getting xfrm states

2018-04-18 Thread Eyal Birger
This commit introduces a helper which allows fetching xfrm state
parameters by eBPF programs attached to TC.

Prototype:
bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)

skb: pointer to skb
index: the index in the skb xfrm_state secpath array
xfrm_state: pointer to 'struct bpf_xfrm_state'
size: size of 'struct bpf_xfrm_state'
flags: reserved for future extensions

The helper returns 0 on success. Non zero if no xfrm state at the index
is found - or non exists at all.

struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
address and the reqid; it can be further extended by adding elements to
its end - indicating the populated fields by the 'size' argument -
keeping backwards compatibility.

Typical usage:

struct bpf_xfrm_state x = {};
bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
...

Signed-off-by: Eyal Birger 
---
 include/uapi/linux/bpf.h | 25 -
 net/core/filter.c| 48 
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9a2d1a0..82b407a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -762,6 +762,15 @@ union bpf_attr {
  * @xdp_md: pointer to xdp_md
  * @delta: A negative integer to be added to xdp_md.data_end
  * Return: 0 on success or negative on error
+ *
+ * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
+ * retrieve XFRM state
+ * @skb: pointer to skb
+ * @index: index of the xfrm state in the secpath
+ * @key: pointer to 'struct bpf_xfrm_state'
+ * @size: size of 'struct bpf_xfrm_state'
+ * @flags: room for future extensions
+ * Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -829,7 +838,8 @@ union bpf_attr {
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
FN(bind),   \
-   FN(xdp_adjust_tail),
+   FN(xdp_adjust_tail),\
+   FN(skb_get_xfrm_state),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -935,6 +945,19 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* user accessible mirror of in-kernel xfrm_state.
+ * new fields can only be added to the end of this structure
+ */
+struct bpf_xfrm_state {
+   __u32 reqid;
+   __u32 spi;
+   __u16 family;
+   union {
+   __u32 remote_ipv4;
+   __u32 remote_ipv6[4];
+   };
+};
+
 /* Generic BPF return codes which all BPF program types may support.
  * The values are binary compatible with their TC_ACT_* counter-part to
  * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
diff --git a/net/core/filter.c b/net/core/filter.c
index 2931859..489d360 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /**
@@ -3749,6 +3750,49 @@ static const struct bpf_func_proto bpf_bind_proto = {
.arg3_type  = ARG_CONST_SIZE,
 };
 
+#ifdef CONFIG_XFRM
+BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index,
+  struct bpf_xfrm_state *, to, u32, size, u64, flags)
+{
+   const struct sec_path *sp = skb_sec_path(skb);
+   const struct xfrm_state *x;
+
+   if (!sp || unlikely(index >= sp->len || flags))
+   goto err_clear;
+
+   x = sp->xvec[index];
+
+   if (unlikely(size != sizeof(struct bpf_xfrm_state)))
+   goto err_clear;
+
+   to->reqid = x->props.reqid;
+   to->spi = be32_to_cpu(x->id.spi);
+   to->family = x->props.family;
+   if (to->family == AF_INET6) {
+   memcpy(to->remote_ipv6, x->props.saddr.a6,
+  sizeof(to->remote_ipv6));
+   } else {
+   to->remote_ipv4 = be32_to_cpu(x->props.saddr.a4);
+   }
+
+   return 0;
+err_clear:
+   memset(to, 0, size);
+   return -EINVAL;
+}
+
+static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
+   .func   = bpf_skb_get_xfrm_state,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_ANYTHING,
+   .arg3_type  = ARG_PTR_TO_UNINIT_MEM,
+   .arg4_type  = ARG_CONST_SIZE,
+   .arg5_type  = ARG_ANYTHING,
+};
+#endif
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -3890,6 +3934,10 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return &bpf_get_socket_cookie_proto;
case BPF_FUNC_get_socket_uid:
return &bpf_get_socket_uid_proto;
+#ifdef CONFIG_XFRM
+ 

[PATCH bpf-next,v2 2/2] samples/bpf: extend test_tunnel_bpf.sh with xfrm state test

2018-04-18 Thread Eyal Birger
Add a test for fetching xfrm state parameters from a tc program running
on ingress.

Signed-off-by: Eyal Birger 
---
 samples/bpf/tcbpf2_kern.c | 15 +++
 samples/bpf/test_tunnel_bpf.sh| 71 +++
 tools/include/uapi/linux/bpf.h| 25 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 4 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 9a8db7bd..3303803 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -593,4 +593,19 @@ int _ip6ip6_get_tunnel(struct __sk_buff *skb)
return TC_ACT_OK;
 }
 
+SEC("xfrm_get_state")
+int _xfrm_get_state(struct __sk_buff *skb)
+{
+   struct bpf_xfrm_state x;
+   char fmt[] = "reqid %d spi 0x%x remote ip 0x%x\n";
+   int ret;
+
+   ret = bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
+   if (ret < 0)
+   return TC_ACT_OK;
+
+   bpf_trace_printk(fmt, sizeof(fmt), x.reqid, x.spi, x.remote_ipv4);
+   return TC_ACT_OK;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index c265863..9c534dc 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -155,6 +155,57 @@ function add_ipip_tunnel {
ip addr add dev $DEV 10.1.1.200/24
 }
 
+function setup_xfrm_tunnel {
+   auth=0x$(printf '1%.0s' {1..40})
+   enc=0x$(printf '2%.0s' {1..32})
+   spi_in_to_out=0x1
+   spi_out_to_in=0x2
+   # in namespace
+   # in -> out
+   ip netns exec at_ns0 \
+   ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
+   spi $spi_in_to_out reqid 1 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+   ip netns exec at_ns0 \
+   ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir out \
+   tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
+   mode tunnel
+   # out -> in
+   ip netns exec at_ns0 \
+   ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
+   spi $spi_out_to_in reqid 2 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+   ip netns exec at_ns0 \
+   ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir in \
+   tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
+   mode tunnel
+   # address & route
+   ip netns exec at_ns0 \
+   ip addr add dev veth0 10.1.1.100/32
+   ip netns exec at_ns0 \
+   ip route add 10.1.1.200 dev veth0 via 172.16.1.200 \
+   src 10.1.1.100
+
+   # out of namespace
+   # in -> out
+   ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
+   spi $spi_in_to_out reqid 1 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
+   ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir in \
+   tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
+   mode tunnel
+   # out -> in
+   ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
+   spi $spi_out_to_in reqid 2 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
+   ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir out \
+   tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
+   mode tunnel
+   # address & route
+   ip addr add dev veth1 10.1.1.200/32
+   ip route add 10.1.1.100 dev veth1 via 172.16.1.100 src 10.1.1.200
+}
+
 function attach_bpf {
DEV=$1
SET_TUNNEL=$2
@@ -278,6 +329,22 @@ function test_ipip {
cleanup
 }
 
+function test_xfrm_tunnel {
+   config_device
+tcpdump -nei veth1 ip &
+   output=$(mktemp)
+   cat /sys/kernel/debug/tracing/trace_pipe | tee $output &
+setup_xfrm_tunnel
+   tc qdisc add dev veth1 clsact
+   tc filter add dev veth1 proto ip ingress bpf da obj tcbpf2_kern.o \
+   sec xfrm_get_state
+   ip netns exec at_ns0 ping -c 1 10.1.1.200
+   grep "reqid 1" $output
+   grep "spi 0x1" $output
+   grep "remote ip 0xac100164" $output
+   cleanup
+}
+
 function cleanup {
set +ex
pkill iperf
@@ -291,6 +358,8 @@ function cleanup {
ip link del geneve11
ip link del erspan11
ip link del ip6erspan11
+   ip x s flush
+   ip x p flush
pkill tcpdump
pkill cat
set -ex
@@ -316,4 +385,6 @@ echo "Testing GENEVE tunnel..."
 test_geneve
 ech

[PATCH bpf-next,v2 0/2] bpf: add helper for getting xfrm states

2018-04-18 Thread Eyal Birger
This patchset adds support for fetching XFRM state information from
an eBPF program called from TC.

The first patch introduces a helper for fetching an XFRM state from the
skb's secpath. The XFRM state is modeled using a new virtual struct which
contains the SPI, peer address, and reqid values of the state; This struct
can be extended in the future to provide additional state information.

The second patch adds a test example in test_tunnel_bpf.sh. The sample
validates the correct extraction of state information by the eBPF program.

---
v2:
  - Fixed two comments by Daniel Borkmann:
- disallow reserved flags in helper call
- avoid compiling in helper code when CONFIG_XFRM is off

Eyal Birger (2):
  bpf: add helper for getting xfrm states
  samples/bpf: extend test_tunnel_bpf.sh with xfrm state test

 include/uapi/linux/bpf.h  | 25 ++-
 net/core/filter.c | 48 +
 samples/bpf/tcbpf2_kern.c | 15 +++
 samples/bpf/test_tunnel_bpf.sh| 71 +++
 tools/include/uapi/linux/bpf.h| 25 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 6 files changed, 185 insertions(+), 3 deletions(-)

-- 
2.7.4



[PATCH bpf-next 1/2] bpf: add helper for getting xfrm states

2018-04-16 Thread Eyal Birger
This commit introduces a helper which allows fetching xfrm state
parameters by eBPF programs attached to TC.

Prototype:
bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)

skb: pointer to skb
index: the index in the skb xfrm_state secpath array
xfrm_state: pointer to 'struct bpf_xfrm_state'
size: size of 'struct bpf_xfrm_state'
flags: reserved for future extensions

The helper returns 0 on success. Non zero if no xfrm state at the index
is found - or non exists at all.

struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
address and the reqid; it can be further extended by adding elements to
its end - indicating the populated fields by the 'size' argument -
keeping backwards compatibility.

Typical usage:

struct bpf_xfrm_state x = {};
bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
...

Signed-off-by: Eyal Birger 
---
 include/uapi/linux/bpf.h | 25 -
 net/core/filter.c| 46 ++
 2 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec897..132e172 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -755,6 +755,15 @@ union bpf_attr {
  * @addr: pointer to struct sockaddr to bind socket to
  * @addr_len: length of sockaddr structure
  * Return: 0 on success or negative error code
+ *
+ * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
+ * retrieve XFRM state
+ * @skb: pointer to skb
+ * @index: index of the xfrm state in the secpath
+ * @key: pointer to 'struct bpf_xfrm_state'
+ * @size: size of 'struct bpf_xfrm_state'
+ * @flags: room for future extensions
+ * Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -821,7 +830,8 @@ union bpf_attr {
FN(msg_apply_bytes),\
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
-   FN(bind),
+   FN(bind),   \
+   FN(skb_get_xfrm_state),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -927,6 +937,19 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* user accessible mirror of in-kernel xfrm_state.
+ * new fields can only be added to the end of this structure
+ */
+struct bpf_xfrm_state {
+   __u32 reqid;
+   __u32 spi;
+   __u16 family;
+   union {
+   __u32 remote_ipv4;
+   __u32 remote_ipv6[4];
+   };
+};
+
 /* Generic BPF return codes which all BPF program types may support.
  * The values are binary compatible with their TC_ACT_* counter-part to
  * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
diff --git a/net/core/filter.c b/net/core/filter.c
index d31aff9..c06600a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /**
@@ -3703,6 +3704,49 @@ static const struct bpf_func_proto bpf_bind_proto = {
.arg3_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index,
+  struct bpf_xfrm_state *, to, u32, size, u64, flags)
+{
+#ifdef CONFIG_XFRM
+   const struct sec_path *sp = skb_sec_path(skb);
+   const struct xfrm_state *x;
+
+   if (!sp || index >= sp->len)
+   goto err_clear;
+
+   x = sp->xvec[index];
+
+   if (unlikely(size != sizeof(struct bpf_xfrm_state)))
+   goto err_clear;
+
+   to->reqid = x->props.reqid;
+   to->spi = be32_to_cpu(x->id.spi);
+   to->family = x->props.family;
+   if (to->family == AF_INET6) {
+   memcpy(to->remote_ipv6, x->props.saddr.a6,
+  sizeof(to->remote_ipv6));
+   } else {
+   to->remote_ipv4 = be32_to_cpu(x->props.saddr.a4);
+   }
+
+   return 0;
+err_clear:
+#endif
+   memset(to, 0, size);
+   return -EINVAL;
+}
+
+static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
+   .func   = bpf_skb_get_xfrm_state,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_ANYTHING,
+   .arg3_type  = ARG_PTR_TO_UNINIT_MEM,
+   .arg4_type  = ARG_CONST_SIZE,
+   .arg5_type  = ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -3844,6 +3888,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return &bpf_get_socket_cookie_proto;
case BPF_FUNC_get_socket_uid:
return &bpf_get_socket_uid_proto;
+   case BPF_FUNC_skb_get_xfrm_state:
+   

[PATCH bpf-next 2/2] samples/bpf: extend test_tunnel_bpf.sh with xfrm state test

2018-04-16 Thread Eyal Birger
Add a test for fetching xfrm state parameters from a tc program running
on ingress.

Signed-off-by: Eyal Birger 
---
 samples/bpf/tcbpf2_kern.c | 15 +++
 samples/bpf/test_tunnel_bpf.sh| 71 +++
 tools/include/uapi/linux/bpf.h| 25 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
 4 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 9a8db7bd..3303803 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -593,4 +593,19 @@ int _ip6ip6_get_tunnel(struct __sk_buff *skb)
return TC_ACT_OK;
 }
 
+SEC("xfrm_get_state")
+int _xfrm_get_state(struct __sk_buff *skb)
+{
+   struct bpf_xfrm_state x;
+   char fmt[] = "reqid %d spi 0x%x remote ip 0x%x\n";
+   int ret;
+
+   ret = bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
+   if (ret < 0)
+   return TC_ACT_OK;
+
+   bpf_trace_printk(fmt, sizeof(fmt), x.reqid, x.spi, x.remote_ipv4);
+   return TC_ACT_OK;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index c265863..9c534dc 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -155,6 +155,57 @@ function add_ipip_tunnel {
ip addr add dev $DEV 10.1.1.200/24
 }
 
+function setup_xfrm_tunnel {
+   auth=0x$(printf '1%.0s' {1..40})
+   enc=0x$(printf '2%.0s' {1..32})
+   spi_in_to_out=0x1
+   spi_out_to_in=0x2
+   # in namespace
+   # in -> out
+   ip netns exec at_ns0 \
+   ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
+   spi $spi_in_to_out reqid 1 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+   ip netns exec at_ns0 \
+   ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir out \
+   tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
+   mode tunnel
+   # out -> in
+   ip netns exec at_ns0 \
+   ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
+   spi $spi_out_to_in reqid 2 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+   ip netns exec at_ns0 \
+   ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir in \
+   tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
+   mode tunnel
+   # address & route
+   ip netns exec at_ns0 \
+   ip addr add dev veth0 10.1.1.100/32
+   ip netns exec at_ns0 \
+   ip route add 10.1.1.200 dev veth0 via 172.16.1.200 \
+   src 10.1.1.100
+
+   # out of namespace
+   # in -> out
+   ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
+   spi $spi_in_to_out reqid 1 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
+   ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir in \
+   tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
+   mode tunnel
+   # out -> in
+   ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
+   spi $spi_out_to_in reqid 2 mode tunnel \
+   auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
+   ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir out \
+   tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
+   mode tunnel
+   # address & route
+   ip addr add dev veth1 10.1.1.200/32
+   ip route add 10.1.1.100 dev veth1 via 172.16.1.100 src 10.1.1.200
+}
+
 function attach_bpf {
DEV=$1
SET_TUNNEL=$2
@@ -278,6 +329,22 @@ function test_ipip {
cleanup
 }
 
+function test_xfrm_tunnel {
+   config_device
+tcpdump -nei veth1 ip &
+   output=$(mktemp)
+   cat /sys/kernel/debug/tracing/trace_pipe | tee $output &
+setup_xfrm_tunnel
+   tc qdisc add dev veth1 clsact
+   tc filter add dev veth1 proto ip ingress bpf da obj tcbpf2_kern.o \
+   sec xfrm_get_state
+   ip netns exec at_ns0 ping -c 1 10.1.1.200
+   grep "reqid 1" $output
+   grep "spi 0x1" $output
+   grep "remote ip 0xac100164" $output
+   cleanup
+}
+
 function cleanup {
set +ex
pkill iperf
@@ -291,6 +358,8 @@ function cleanup {
ip link del geneve11
ip link del erspan11
ip link del ip6erspan11
+   ip x s flush
+   ip x p flush
pkill tcpdump
pkill cat
set -ex
@@ -316,4 +385,6 @@ echo "Testing GENEVE tunnel..."
 test_geneve
 ech

[PATCH bpf-next 0/2] bpf: add helper for getting xfrm states

2018-04-16 Thread Eyal Birger
This patchset adds support for fetching XFRM state information from
an eBPF program called from TC.

The first patch introduces a helper for fetching an XFRM state from the
skb's secpath. The XFRM state is modeled using a new virtual struct which
contains the SPI, peer address, and reqid values of the state; This struct
can be extended in the future to provide additional state information.

The second patch adds a test example in test_tunnel_bpf.sh. The sample
validates the correct extraction of state information by the eBPF program.

---


Eyal Birger (2):
  bpf: add helper for getting xfrm states
  samples/bpf: extend test_tunnel_bpf.sh with xfrm state test

 include/uapi/linux/bpf.h  | 25 ++-
 net/core/filter.c | 46 
 samples/bpf/tcbpf2_kern.c | 15 +++
 samples/bpf/test_tunnel_bpf.sh| 71 +++
 tools/include/uapi/linux/bpf.h| 25 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
 6 files changed, 183 insertions(+), 2 deletions(-)

-- 
2.7.4



[PATCH iproute2-next 0/2] tc: add ipt ematch

2018-02-23 Thread Eyal Birger
This patchset extends tc to support the ipt ematch.

The first patch adds the ability for ematch cmdline parsers
to receive argc,argv parameters.
The second patch adds the em_ipt module.

Eyal Birger (2):
  tc: ematch: add parse_eopt_argv() method for providing ematches with
argv parameters
  tc: add em_ipt ematch for calling xtables matches from tc matching
context

 etc/iproute2/ematch_map |   1 +
 man/man8/tc-ematch.8|  15 
 tc/Makefile |   7 ++
 tc/em_ipt.c | 208 
 tc/m_ematch.c   |  27 ++-
 tc/m_ematch.h   |   2 +
 6 files changed, 259 insertions(+), 1 deletion(-)
 create mode 100644 tc/em_ipt.c

-- 
2.7.4



[PATCH iproute2-next 1/2] tc: ematch: add parse_eopt_argv() method for providing ematches with argv parameters

2018-02-23 Thread Eyal Birger
ematche uses YACC to parse ematch arguments and places them in struct bstr
linked lists.

It is useful to be able to receive parameters as argc,argv in order to use
getopt (and alike) argument parsers.

Signed-off-by: Eyal Birger 
---
 tc/m_ematch.c | 27 ++-
 tc/m_ematch.h |  2 ++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/tc/m_ematch.c b/tc/m_ematch.c
index 7dbc2e7..61f8fb3 100644
--- a/tc/m_ematch.c
+++ b/tc/m_ematch.c
@@ -169,6 +169,31 @@ static struct ematch_util *get_ematch_kind_num(__u16 kind)
return get_ematch_kind(name);
 }
 
+static int em_parse_call(struct nlmsghdr *n, struct tcf_ematch_hdr *hdr,
+struct ematch_util *e, struct ematch *t)
+{
+   if (e->parse_eopt_argv) {
+   int argc = 0, i = 0, ret;
+   struct bstr *args;
+   char **argv;
+
+   for (args = t->args; args; args = bstr_next(args))
+   argc++;
+   argv = calloc(argc, sizeof(char *));
+   if (!argv)
+   return -1;
+   for (args = t->args; args; args = bstr_next(args))
+   argv[i++] = args->data;
+
+   ret = e->parse_eopt_argv(n, hdr, argc, argv);
+
+   free(argv);
+   return ret;
+   }
+
+   return e->parse_eopt(n, hdr, t->args->next);
+}
+
 static int parse_tree(struct nlmsghdr *n, struct ematch *tree)
 {
int index = 1;
@@ -212,7 +237,7 @@ static int parse_tree(struct nlmsghdr *n, struct ematch 
*tree)
}
 
hdr.kind = num;
-   if (e->parse_eopt(n, &hdr, t->args->next) < 0)
+   if (em_parse_call(n, &hdr, e, t) < 0)
return -1;
}
 
diff --git a/tc/m_ematch.h b/tc/m_ematch.h
index fa6e214..f634f19 100644
--- a/tc/m_ematch.h
+++ b/tc/m_ematch.h
@@ -88,6 +88,8 @@ struct ematch_util
int kind_num;
int (*parse_eopt)(struct nlmsghdr *,struct tcf_ematch_hdr *,
  struct bstr *);
+   int (*parse_eopt_argv)(struct nlmsghdr *, struct tcf_ematch_hdr *,
+  int, char **);
int (*print_eopt)(FILE *, struct tcf_ematch_hdr *, void *, int);
void(*print_usage)(FILE *);
struct ematch_util  *next;
-- 
2.7.4



[PATCH iproute2-next 2/2] tc: add em_ipt ematch for calling xtables matches from tc matching context

2018-02-23 Thread Eyal Birger
The commit calls a new tc ematch for using netfilter xtable matches.

This allows early classification as well as mirroning/redirecting traffic
based on logic implemented in netfilter extensions.

Current supported use case is classification based on the incoming IPSec
state used during decpsulation using the 'policy' iptables extension
(xt_policy).

The matcher uses libxtables for parsing the input parameters.

Example use for matching an IPSec state with reqid 1:

tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent : \
basic match 'ipt(-m policy --dir in --pol ipsec --reqid 1)' \
action drop

This is the user-space counter part of kernel commit ccc007e4a746
("net: sched: add em_ipt ematch for calling xtables matches")

Signed-off-by: Eyal Birger 
---
 etc/iproute2/ematch_map |   1 +
 man/man8/tc-ematch.8|  15 
 tc/Makefile |   7 ++
 tc/em_ipt.c | 208 
 4 files changed, 231 insertions(+)
 create mode 100644 tc/em_ipt.c

diff --git a/etc/iproute2/ematch_map b/etc/iproute2/ematch_map
index 1823983..4d6bb2f 100644
--- a/etc/iproute2/ematch_map
+++ b/etc/iproute2/ematch_map
@@ -5,3 +5,4 @@
 4  meta
 7  canid
 8  ipset
+9  ipt
diff --git a/man/man8/tc-ematch.8 b/man/man8/tc-ematch.8
index b9bf70c..042f840 100644
--- a/man/man8/tc-ematch.8
+++ b/man/man8/tc-ematch.8
@@ -98,6 +98,17 @@ When using the ipset ematch with the "ip_set_hash:net,iface" 
set type,
 the interface can be queried using "src,dst (source ip address, outgoing 
interface) or
 "src,src" (source ip address, incoming interface) syntax.
 
+.SS ipt
+test packet against xtables matches
+
+.IR ipt "( " [-6] " "-m " " MATCH_NAME " " FLAGS " )
+
+.IR MATCH_NAME " := " string
+
+.IR FLAGS " := { " FLAG " [, " FLAGS "] }
+
+The flag options are the same as those used by the xtable match used.
+
 .SH CAVEATS
 
 The ematch syntax uses '(' and ')' to group expressions. All braces need to be
@@ -127,6 +138,10 @@ Check if packet source ip and the interface the packet 
arrived on is member of "
 
 # 'ipset(interactive src,src)'
 
+Check if packet matches an IPSec state with reqid 1:
+
+# 'ipt(-m policy --dir in --pol ipsec --reqid 1)'
+
 .SH "AUTHOR"
 
 The extended match infrastructure was added by Thomas Graf.
diff --git a/tc/Makefile b/tc/Makefile
index 3716dd6..dfd0026 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -80,6 +80,7 @@ endif
 ifneq ($(TC_CONFIG_NO_XT),y)
   ifeq ($(TC_CONFIG_XT),y)
 TCSO += m_xt.so
+TCMODULES += em_ipt.o
 ifeq ($(TC_CONFIG_IPSET),y)
   TCMODULES += em_ipset.o
 endif
@@ -163,6 +164,12 @@ m_xt_old.so: m_xt_old.c
 
 em_ipset.o: CFLAGS += $$($(PKG_CONFIG) xtables --cflags)
 
+em_ipt.o: CFLAGS += $$($(PKG_CONFIG) xtables --cflags)
+
+ifeq ($(TC_CONFIG_XT),y)
+  LDFLAGS += $$($(PKG_CONFIG) xtables --libs)
+endif
+
 %.yacc.c: %.y
$(QUIET_YACC)$(YACC) $(YACCFLAGS) -o $@ $<
 
diff --git a/tc/em_ipt.c b/tc/em_ipt.c
new file mode 100644
index 000..9d2b2f9
--- /dev/null
+++ b/tc/em_ipt.c
@@ -0,0 +1,208 @@
+/*
+ * em_ipt.cIPtables extenstions matching Ematch
+ *
+ * (C) 2018 Eyal Birger 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+#include "m_ematch.h"
+
+static void em_ipt_print_usage(FILE *fd)
+{
+   fprintf(fd,
+   "Usage: ipt([-6] -m MATCH_NAME [MATCH_OPTS])\n"
+   "Example: 'ipt(-m policy --reqid 1 --pol ipsec --dir in)'\n");
+}
+
+static struct option original_opts[] = {
+   {
+   .name = "match",
+   .has_arg = 1,
+   .val = 'm'
+   },
+   {
+   .name = "ipv6",
+   .val = '6'
+   },
+   {}
+};
+
+static struct xtables_globals em_tc_ipt_globals = {
+   .option_offset = 0,
+   .program_name = "tc-em-ipt",
+   .program_version = "0.1",
+   .orig_opts = original_opts,
+   .opts = original_opts,
+   .compat_rev = xtables_compatible_revision,
+};
+
+static struct xt_entry_match *fake_xt_entry_match(int data_size, void *data)
+{
+   struct xt_entry_match *m;
+
+   m = xtables_calloc(1, XT_ALIGN(sizeof(*m)) + data_size);
+   if (!m)
+   return NULL;
+
+   if (data)
+   memcpy(m->data, data, data_size);
+
+   m->u.user.match_size = data_size;
+   return m;
+}
+
+static void scrub_match(struct xtables_match *match)
+{
+   match->mflags = 0;
+   free(match->m);
+   match->m = NU

Re: [PATCH net-next 1/2] net: netfilter: export xt_policy match_policy_in() as xt_policy_match_policy_in()

2018-02-15 Thread Eyal Birger
Hi Pablo,

On Wed, 14 Feb 2018 11:19:40 +0100
Pablo Neira Ayuso  wrote:

> On Wed, Feb 14, 2018 at 10:14:24AM +0200, Eyal Birger wrote:
> > Hi Pablo,
> > 
> > On Mon, 15 Jan 2018 13:48:41 +0200
> > Eyal Birger  wrote:
> >   
> > > On Mon, Jan 15, 2018 at 12:57 PM, Pablo Neira Ayuso
> > >  wrote:  
> > > > On Sun, Jan 14, 2018 at 02:47:46PM +0200, Eyal Birger wrote:
> > > >> On Fri, Jan 12, 2018 at 4:00 PM, Pablo Neira Ayuso
> > > >>  wrote:
> > > >> > On Fri, Jan 12, 2018 at 03:56:21PM +0200, Eyal Birger
> > > >> > wrote:
> > > >> >> On Fri, Jan 12, 2018 at 3:41 PM, Pablo Neira Ayuso
> > > >> >>  wrote:
> > > >> >> > On Fri, Jan 12, 2018 at 02:57:24PM +0200, Eyal Birger
> > > >> >> > wrote:
> > > >> >> >> @@ -51,9 +52,9 @@ match_xfrm_state(const struct
> > > >> >> >> xfrm_state *x, const struct xt_policy_elem *e,
> > > >> >> >> MATCH(reqid, x->props.reqid); }
> > > >> >> >>
> > > >> >> >> -static int
> > > >> >> >> -match_policy_in(const struct sk_buff *skb, const struct
> > > >> >> >> xt_policy_info *info,
> > > >> >> >> - unsigned short family)
> > > >> >> >> +int xt_policy_match_policy_in(const struct sk_buff *skb,
> > > >> >> >> +   const struct xt_policy_info
> > > >> >> >> *info,
> > > >> >> >> +   unsigned short family)
> > > >> >> >>  {
> > > >> >> >>   const struct xt_policy_elem *e;
> > > >> >> >>   const struct sec_path *sp = skb->sp;
> > > >> >> >> @@ -80,10 +81,11 @@ match_policy_in(const struct sk_buff
> > > >> >> >> *skb, const struct xt_policy_info *info,
> > > >> >> >>
> > > >> >> >>   return strict ? 1 : 0;
> > > >> >> >>  }
> > > >> >> >> +EXPORT_SYMBOL_GPL(xt_policy_match_policy_in);
> > > >> >> >
> > > >> >> > If you just want to call xt_policy_match from tc, then you
> > > >> >> > could use tc ipt infrastructure instead.
> > > >> >>
> > > >> >> Thanks for the suggestion -
> > > >> >> Are you referring to act_ipt? it looks like it allows
> > > >> >> calling targets; I couldn't find a classifier calling a
> > > >> >> netfilter matcher.
> > > >> >
> > > >> > Then, I'd suggest you extend that infrastructure to alllow to
> > > >> > call matches, so we reduce the number of interdepencies
> > > >> > between different subsystems.
> > > >>
> > > >> This appears very versatile. though in this case the use of the
> > > >> xtables code and structures was done in order to avoid
> > > >> introducing new uapi structures and supporting
> > > >> match code, not necessarily to expose the full capabilities of
> > > >> extended matches, similar in spirit to what was done in the
> > > >> em_ipset ematch.
> > > >>
> > > >> Perhaps in order to avoid the direct export of xt_policy code,
> > > >> I could call xt_request_find_match() from the em_policy module,
> > > >> requesting the xt_policy match?
> > > >> this way api exposure is minimized while not overly
> > > >> complicating the scope of this feature.
> > > >>
> > > >> What do you think?
> > > >
> > > > That would look better indeed.
> > > >
> > > > But once you call xt_request_find_match() from there, how far
> > > > is to allow any arbitrary match? I think you only have to
> > > > specify the match name, family and the binary layout structure
> > > > that represents xt_policy, right?
> > > >
> > > 
> > > I don't think that should be a problem. I'd need to pass the
> > > protocol onto the ematches .change() callbacks and get the
> > > appropriate match from there.
> > >   
> > > > I'm telling this, because I think it wou

[PATCH net-next,v3] net: sched: add em_ipt ematch for calling xtables matches

2018-02-15 Thread Eyal Birger
The commit a new tc ematch for using netfilter xtable matches.

This allows early classification as well as mirroning/redirecting traffic
based on logic implemented in netfilter extensions.

Current supported use case is classification based on the incoming IPSec
state used during decpsulation using the 'policy' iptables extension
(xt_policy).

The module dynamically fetches the netfilter match module and calls
it using a fake xt_action_param structure based on validated userspace
provided parameters.

As the xt_policy match does not access skb->data, no skb modifications
are needed on match.

Signed-off-by: Eyal Birger 

---

v3:
  - limit supported match to xt_policy and validate parameters
  - receive match protocol from userspace
v2:
  - Remove skb push/pull and limit functionality to ingress
---
 include/uapi/linux/pkt_cls.h |   3 +-
 include/uapi/linux/tc_ematch/tc_em_ipt.h |  20 +++
 net/sched/Kconfig|  12 ++
 net/sched/Makefile   |   1 +
 net/sched/em_ipt.c   | 257 +++
 5 files changed, 292 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/tc_ematch/tc_em_ipt.h
 create mode 100644 net/sched/em_ipt.c

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 46c5066..7cafb26d 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -555,7 +555,8 @@ enum {
 #defineTCF_EM_VLAN 6
 #defineTCF_EM_CANID7
 #defineTCF_EM_IPSET8
-#defineTCF_EM_MAX  8
+#defineTCF_EM_IPT  9
+#defineTCF_EM_MAX  9
 
 enum {
TCF_EM_PROG_TC
diff --git a/include/uapi/linux/tc_ematch/tc_em_ipt.h 
b/include/uapi/linux/tc_ematch/tc_em_ipt.h
new file mode 100644
index 000..49a6553
--- /dev/null
+++ b/include/uapi/linux/tc_ematch/tc_em_ipt.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __LINUX_TC_EM_IPT_H
+#define __LINUX_TC_EM_IPT_H
+
+#include 
+#include 
+
+enum {
+   TCA_EM_IPT_UNSPEC,
+   TCA_EM_IPT_HOOK,
+   TCA_EM_IPT_MATCH_NAME,
+   TCA_EM_IPT_MATCH_REVISION,
+   TCA_EM_IPT_NFPROTO,
+   TCA_EM_IPT_MATCH_DATA,
+   __TCA_EM_IPT_MAX
+};
+
+#define TCA_EM_IPT_MAX (__TCA_EM_IPT_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index f24a6ae..a01169f 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -658,6 +658,18 @@ config NET_EMATCH_IPSET
  To compile this code as a module, choose M here: the
  module will be called em_ipset.
 
+config NET_EMATCH_IPT
+   tristate "IPtables Matches"
+   depends on NET_EMATCH && NETFILTER && NETFILTER_XTABLES
+   ---help---
+ Say Y here to be able to classify packets based on iptables
+ matches.
+ Current supported match is "policy" which allows packet classification
+ based on IPsec policy that was used during decapsulation
+
+ To compile this code as a module, choose M here: the
+ module will be called em_ipt.
+
 config NET_CLS_ACT
bool "Actions"
select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5b63544..8811d38 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -75,3 +75,4 @@ obj-$(CONFIG_NET_EMATCH_META) += em_meta.o
 obj-$(CONFIG_NET_EMATCH_TEXT)  += em_text.o
 obj-$(CONFIG_NET_EMATCH_CANID) += em_canid.o
 obj-$(CONFIG_NET_EMATCH_IPSET) += em_ipset.o
+obj-$(CONFIG_NET_EMATCH_IPT)   += em_ipt.o
diff --git a/net/sched/em_ipt.c b/net/sched/em_ipt.c
new file mode 100644
index 000..a5f34e9
--- /dev/null
+++ b/net/sched/em_ipt.c
@@ -0,0 +1,257 @@
+/*
+ * net/sched/em_ipt.c IPtables matches Ematch
+ *
+ * (c) 2018 Eyal Birger 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct em_ipt_match {
+   const struct xt_match *match;
+   u32 hook;
+   u8 match_data[0] __aligned(8);
+};
+
+struct em_ipt_xt_match {
+   char *match_name;
+   int (*validate_match_data)(struct nlattr **tb, u8 mrev);
+};
+
+static const struct nla_policy em_ipt_policy[TCA_EM_IPT_MAX + 1] = {
+   [TCA_EM_IPT_MATCH_NAME] = { .type = NLA_STRING,
+   .len = XT_EXTENSION_MAXNAMELEN },
+   [TCA_EM_IPT_MATCH_REVISION] = { .type = NLA_U8 },
+   [TCA_EM_IPT_HOOK]   = { .type = NLA_U32 },
+   [TCA_EM_IPT_NFPROTO]= { .type = NLA_U8 },
+   [TCA_EM_IPT_MATCH_DATA] = { .type = NLA_UNSPEC },
+};
+
+

Re: [PATCH net-next 1/2] net: netfilter: export xt_policy match_policy_in() as xt_policy_match_policy_in()

2018-02-14 Thread Eyal Birger
Hi Pablo,

On Mon, 15 Jan 2018 13:48:41 +0200
Eyal Birger  wrote:

> On Mon, Jan 15, 2018 at 12:57 PM, Pablo Neira Ayuso
>  wrote:
> > On Sun, Jan 14, 2018 at 02:47:46PM +0200, Eyal Birger wrote:  
> >> On Fri, Jan 12, 2018 at 4:00 PM, Pablo Neira Ayuso
> >>  wrote:  
> >> > On Fri, Jan 12, 2018 at 03:56:21PM +0200, Eyal Birger wrote:  
> >> >> On Fri, Jan 12, 2018 at 3:41 PM, Pablo Neira Ayuso
> >> >>  wrote:  
> >> >> > On Fri, Jan 12, 2018 at 02:57:24PM +0200, Eyal Birger wrote:  
> >> >> >> @@ -51,9 +52,9 @@ match_xfrm_state(const struct xfrm_state
> >> >> >> *x, const struct xt_policy_elem *e, MATCH(reqid,
> >> >> >> x->props.reqid); }
> >> >> >>
> >> >> >> -static int
> >> >> >> -match_policy_in(const struct sk_buff *skb, const struct
> >> >> >> xt_policy_info *info,
> >> >> >> - unsigned short family)
> >> >> >> +int xt_policy_match_policy_in(const struct sk_buff *skb,
> >> >> >> +   const struct xt_policy_info
> >> >> >> *info,
> >> >> >> +   unsigned short family)
> >> >> >>  {
> >> >> >>   const struct xt_policy_elem *e;
> >> >> >>   const struct sec_path *sp = skb->sp;
> >> >> >> @@ -80,10 +81,11 @@ match_policy_in(const struct sk_buff
> >> >> >> *skb, const struct xt_policy_info *info,
> >> >> >>
> >> >> >>   return strict ? 1 : 0;
> >> >> >>  }
> >> >> >> +EXPORT_SYMBOL_GPL(xt_policy_match_policy_in);  
> >> >> >
> >> >> > If you just want to call xt_policy_match from tc, then you
> >> >> > could use tc ipt infrastructure instead.  
> >> >>
> >> >> Thanks for the suggestion -
> >> >> Are you referring to act_ipt? it looks like it allows calling
> >> >> targets; I couldn't find a classifier calling a netfilter
> >> >> matcher.  
> >> >
> >> > Then, I'd suggest you extend that infrastructure to alllow to
> >> > call matches, so we reduce the number of interdepencies between
> >> > different subsystems.  
> >>
> >> This appears very versatile. though in this case the use of the
> >> xtables code and structures was done in order to avoid introducing
> >> new uapi structures and supporting
> >> match code, not necessarily to expose the full capabilities of
> >> extended matches, similar in spirit to what was done in the
> >> em_ipset ematch.
> >>
> >> Perhaps in order to avoid the direct export of xt_policy code, I
> >> could call xt_request_find_match() from the em_policy module,
> >> requesting the xt_policy match?
> >> this way api exposure is minimized while not overly complicating
> >> the scope of this feature.
> >>
> >> What do you think?  
> >
> > That would look better indeed.
> >
> > But once you call xt_request_find_match() from there, how far is to
> > allow any arbitrary match? I think you only have to specify the
> > match name, family and the binary layout structure that represents
> > xt_policy, right?
> >  
> 
> I don't think that should be a problem. I'd need to pass the protocol
> onto the ematches .change() callbacks and get the appropriate match
> from there.
> 
> > I'm telling this, because I think it would be fair enough to me if
> > you add the generic infrastructure to the kernel to allow arbitrary
> > load of xt matches, and then from userspace you just add the code to
> > support this which is what you need.
> >
> > Probably someone else - not you - may follow up later on to
> > generalize the userspace codebase to support other matches, by when
> > that happens, the right bits will be in the kernel already.  
> 
> I'm fine with submitting the more generic infrastructure.
> Will follow up with a new series.

Following up on this thread, I think this feature would better be
implemented utilizing xt_policy from tc instead of supporting arbitrary
xt matches.

Feedback on the generic framework ([1], [2]) revolved around the ability
to create the skb environment for running matches accessing the
skb->data.

My concern is that it would be difficult to maintain the correct
environment for any xt match, whereas it is simple to create a
designated ematch for a specific xt match - as done for ipset - which
can validate the necessary prerequisites for that xt match.

It is also simple to dynamically fetch the xt_policy match function
using xt_request_find_match() as suggested in the em_ipt submittion.

I'd very much appreciate your feedback.

Thanks,
Eyal.

[1] https://patchwork.ozlabs.org/patch/864683/
[2] https://patchwork.ozlabs.org/patch/866490/


Re: [PATCH ipsec,v3] net: xfrm_policy: fix device unregistration hang

2018-02-13 Thread Eyal Birger
Hi Xin Long,

On Tue, 13 Feb 2018 23:18:14 +0800
Xin Long  wrote:

> On Tue, Feb 13, 2018 at 6:54 PM, Eyal Birger 
> wrote:
> > In setups like the following:
> >
> >Host A  --Host B
> > tun0 -- ipsec -- eth0  --  eth0 -- ipsec -- tun0
> >
> > where tun0 are tunnel devices using dst_cache (ipip, ipip6, etc...).
> >
> > Unregistration of an underlying eth0 device leads to the following
> > log messages:
> >
> > unregister_netdevice: waiting for eth0 to become free. Usage count
> > = 2
> >
> > This is since xfrm dsts device references are not released upon
> > device unregistration when the xfrm dst is cached in a dst_cache.
> >
> > This issue was first introduced in commit 52df157f17e5
> > ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle")
> > as part of an effort to remove routing garbage collection.
> >
> > Several approaches for fixing this were discussed in [1]; this
> > commit keeps track of allocated xdsts and releases their device
> > references on a netdev unregister/down events.
> >
> > Signed-off-by: Eyal Birger 
> > Fixes: 52df157f17e5 ("xfrm: take refcnt of dst when creating struct
> > xfrm_dst bundle")  
> I had another fix [1] for this issue in my local tree.
> That reuses uncached_list to track xdsts, and fewer job to do for
> this. As naturally uncached_list is already the list that is supposed
> to do this job. We don't have to introduce a new list and members.
> 
> [1] https://paste.fedoraproject.org/paste/X8FQzEXjGXQBTKAVLnDn4w

I have tested your patch and it resolves the issue described.
I like not having to introduce a new list and lock.

Small note: I think you need to increase net->ipv6.rt6_stats->fib_rt_uncache
when adding to the IPv6 uncached list from the xfrm code.

> 
> Not sure if it makes more sense for xfrm?

I wasn't sure whether xfrm may use the uncached list.
But it looks correct to me.

Eyal.


[PATCH ipsec,v3] net: xfrm_policy: fix device unregistration hang

2018-02-13 Thread Eyal Birger
In setups like the following:

   Host A  --Host B
tun0 -- ipsec -- eth0  --  eth0 -- ipsec -- tun0

where tun0 are tunnel devices using dst_cache (ipip, ipip6, etc...).

Unregistration of an underlying eth0 device leads to the following log
messages:

unregister_netdevice: waiting for eth0 to become free. Usage count = 2

This is since xfrm dsts device references are not released upon device
unregistration when the xfrm dst is cached in a dst_cache.

This issue was first introduced in commit 52df157f17e5
("xfrm: take refcnt of dst when creating struct xfrm_dst bundle")
as part of an effort to remove routing garbage collection.

Several approaches for fixing this were discussed in [1]; this commit keeps
track of allocated xdsts and releases their device references on a netdev
unregister/down events.

Signed-off-by: Eyal Birger 
Fixes: 52df157f17e5 ("xfrm: take refcnt of dst when creating struct xfrm_dst 
bundle")

[1] https://patchwork.ozlabs.org/patch/869025/

---

v3:
  - add 'Fixes' tag and document appropriately
v2:
  - call gc flush from existing netdev notifier per Shannon Nelson's
suggestion.
---
 include/net/xfrm.h | 11 +++--
 net/xfrm/xfrm_device.c |  2 ++
 net/xfrm/xfrm_policy.c | 66 ++
 3 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 7d20776..4614047 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -324,6 +324,7 @@ void xfrm_policy_unregister_afinfo(const struct 
xfrm_policy_afinfo *afinfo);
 void km_policy_notify(struct xfrm_policy *xp, int dir,
  const struct km_event *c);
 void xfrm_policy_cache_flush(void);
+void xfrm_policy_dst_gc_flush_dev(struct net_device *dev);
 void km_state_notify(struct xfrm_state *x, const struct km_event *c);
 
 struct xfrm_tmpl;
@@ -994,6 +995,8 @@ struct xfrm_dst {
u32 child_mtu_cached;
u32 route_cookie;
u32 path_cookie;
+   struct list_head xfrm_dst_gc;
+   struct xfrm_dst_list *xfrm_dst_gc_list;
 };
 
 static inline struct dst_entry *xfrm_dst_path(const struct dst_entry *dst)
@@ -1025,13 +1028,7 @@ static inline void xfrm_dst_set_child(struct xfrm_dst 
*xdst, struct dst_entry *c
xdst->child = child;
 }
 
-static inline void xfrm_dst_destroy(struct xfrm_dst *xdst)
-{
-   xfrm_pols_put(xdst->pols, xdst->num_pols);
-   dst_release(xdst->route);
-   if (likely(xdst->u.dst.xfrm))
-   xfrm_state_put(xdst->u.dst.xfrm);
-}
+void xfrm_dst_destroy(struct xfrm_dst *xdst);
 #endif
 
 void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev);
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 8e70291..5bd9892 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -309,6 +309,7 @@ static int xfrm_dev_register(struct net_device *dev)
 static int xfrm_dev_unregister(struct net_device *dev)
 {
xfrm_policy_cache_flush();
+   xfrm_policy_dst_gc_flush_dev(dev);
return NOTIFY_DONE;
 }
 
@@ -323,6 +324,7 @@ static int xfrm_dev_down(struct net_device *dev)
xfrm_dev_state_flush(dev_net(dev), dev, true);
 
xfrm_policy_cache_flush();
+   xfrm_policy_dst_gc_flush_dev(dev);
return NOTIFY_DONE;
 }
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7a23078..c7b9a9a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -45,6 +45,12 @@ struct xfrm_flo {
u8 flags;
 };
 
+struct xfrm_dst_list {
+   spinlock_t lock; /* sync writers */
+   struct list_head head;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct xfrm_dst_list, xfrm_dst_gc_list);
 static DEFINE_PER_CPU(struct xfrm_dst *, xfrm_last_dst);
 static struct work_struct *xfrm_pcpu_work __read_mostly;
 static DEFINE_SPINLOCK(xfrm_policy_afinfo_lock);
@@ -94,6 +100,51 @@ __xfrm6_selector_match(const struct xfrm_selector *sel, 
const struct flowi *fl)
(fl6->flowi6_oif == sel->ifindex || !sel->ifindex);
 }
 
+static void xfrm_dst_add_to_gc_list(struct xfrm_dst *xdst)
+{
+   struct xfrm_dst_list *xl = raw_cpu_ptr(&xfrm_dst_gc_list);
+
+   xdst->xfrm_dst_gc_list = xl;
+
+   spin_lock_bh(&xl->lock);
+   list_add_tail(&xdst->xfrm_dst_gc, &xl->head);
+   spin_unlock_bh(&xl->lock);
+}
+
+void xfrm_dst_destroy(struct xfrm_dst *xdst)
+{
+   xfrm_pols_put(xdst->pols, xdst->num_pols);
+   dst_release(xdst->route);
+   if (likely(xdst->u.dst.xfrm))
+   xfrm_state_put(xdst->u.dst.xfrm);
+   if (!list_empty(&xdst->xfrm_dst_gc)) {
+   struct xfrm_dst_list *xl = xdst->xfrm_dst_gc_list;
+
+   spin_lock_bh(&xl->lock);
+   list_del(&xdst->xfrm_dst_gc);
+   spin_unlock_bh(&xl->lock);
+   }
+}
+EXPORT_SYMBOL_GPL(xfrm_dst_destroy);
+
+v

[PATCH ipsec,v2] net: xfrm_policy: fix device unregistration hang

2018-02-12 Thread Eyal Birger
In setups like the following:

   Host A  --Host B
tun0 -- ipsec -- eth0  --  eth0 -- ipsec -- tun0

where tun0 are tunnel devices using dst_cache (ipip, ipip6, etc...).

Unregistration of an underlying eth0 device leads to the following log
messages:

unregister_netdevice: waiting for eth0 to become free. Usage count = 2

This is since xfrm dsts device references are not released upon device
unregistration when the xfrm dst is cached in a dst_cache.

Several approaches for fixing this were discussed in [1]; this commit keeps
track of allocated xdsts and releases their device references on a netdev
unregister event.

Signed-off-by: Eyal Birger 

[1] https://patchwork.ozlabs.org/patch/869025/

---

v2:
  - call gc flush from existing netdev notifier per Shannon Nelson's
suggestion.
---
 include/net/xfrm.h | 11 +++--
 net/xfrm/xfrm_device.c |  2 ++
 net/xfrm/xfrm_policy.c | 66 ++
 3 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 7d20776..4614047 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -324,6 +324,7 @@ void xfrm_policy_unregister_afinfo(const struct 
xfrm_policy_afinfo *afinfo);
 void km_policy_notify(struct xfrm_policy *xp, int dir,
  const struct km_event *c);
 void xfrm_policy_cache_flush(void);
+void xfrm_policy_dst_gc_flush_dev(struct net_device *dev);
 void km_state_notify(struct xfrm_state *x, const struct km_event *c);
 
 struct xfrm_tmpl;
@@ -994,6 +995,8 @@ struct xfrm_dst {
u32 child_mtu_cached;
u32 route_cookie;
u32 path_cookie;
+   struct list_head xfrm_dst_gc;
+   struct xfrm_dst_list *xfrm_dst_gc_list;
 };
 
 static inline struct dst_entry *xfrm_dst_path(const struct dst_entry *dst)
@@ -1025,13 +1028,7 @@ static inline void xfrm_dst_set_child(struct xfrm_dst 
*xdst, struct dst_entry *c
xdst->child = child;
 }
 
-static inline void xfrm_dst_destroy(struct xfrm_dst *xdst)
-{
-   xfrm_pols_put(xdst->pols, xdst->num_pols);
-   dst_release(xdst->route);
-   if (likely(xdst->u.dst.xfrm))
-   xfrm_state_put(xdst->u.dst.xfrm);
-}
+void xfrm_dst_destroy(struct xfrm_dst *xdst);
 #endif
 
 void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev);
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 8e70291..5bd9892 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -309,6 +309,7 @@ static int xfrm_dev_register(struct net_device *dev)
 static int xfrm_dev_unregister(struct net_device *dev)
 {
xfrm_policy_cache_flush();
+   xfrm_policy_dst_gc_flush_dev(dev);
return NOTIFY_DONE;
 }
 
@@ -323,6 +324,7 @@ static int xfrm_dev_down(struct net_device *dev)
xfrm_dev_state_flush(dev_net(dev), dev, true);
 
xfrm_policy_cache_flush();
+   xfrm_policy_dst_gc_flush_dev(dev);
return NOTIFY_DONE;
 }
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7a23078..c7b9a9a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -45,6 +45,12 @@ struct xfrm_flo {
u8 flags;
 };
 
+struct xfrm_dst_list {
+   spinlock_t lock; /* sync writers */
+   struct list_head head;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct xfrm_dst_list, xfrm_dst_gc_list);
 static DEFINE_PER_CPU(struct xfrm_dst *, xfrm_last_dst);
 static struct work_struct *xfrm_pcpu_work __read_mostly;
 static DEFINE_SPINLOCK(xfrm_policy_afinfo_lock);
@@ -94,6 +100,51 @@ __xfrm6_selector_match(const struct xfrm_selector *sel, 
const struct flowi *fl)
(fl6->flowi6_oif == sel->ifindex || !sel->ifindex);
 }
 
+static void xfrm_dst_add_to_gc_list(struct xfrm_dst *xdst)
+{
+   struct xfrm_dst_list *xl = raw_cpu_ptr(&xfrm_dst_gc_list);
+
+   xdst->xfrm_dst_gc_list = xl;
+
+   spin_lock_bh(&xl->lock);
+   list_add_tail(&xdst->xfrm_dst_gc, &xl->head);
+   spin_unlock_bh(&xl->lock);
+}
+
+void xfrm_dst_destroy(struct xfrm_dst *xdst)
+{
+   xfrm_pols_put(xdst->pols, xdst->num_pols);
+   dst_release(xdst->route);
+   if (likely(xdst->u.dst.xfrm))
+   xfrm_state_put(xdst->u.dst.xfrm);
+   if (!list_empty(&xdst->xfrm_dst_gc)) {
+   struct xfrm_dst_list *xl = xdst->xfrm_dst_gc_list;
+
+   spin_lock_bh(&xl->lock);
+   list_del(&xdst->xfrm_dst_gc);
+   spin_unlock_bh(&xl->lock);
+   }
+}
+EXPORT_SYMBOL_GPL(xfrm_dst_destroy);
+
+void xfrm_policy_dst_gc_flush_dev(struct net_device *dev)
+{
+   struct xfrm_dst *xdst;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   struct xfrm_dst_list *xl = &per_cpu(xfrm_dst_gc_list, cpu);
+
+   spin_lock_bh(&xl->lock);
+   list_for_each_entry(xdst, &xl->head, xfrm_dst_gc) {
+   

Re: [PATCH ipsec] net: xfrm_policy: fix device unregistration hang

2018-02-12 Thread Eyal Birger
On Mon, 12 Feb 2018 09:55:48 -0800
Shannon Nelson  wrote:

> On 2/12/2018 9:21 AM, Eyal Birger wrote:
> > In setups like the following:
> > 
> > Host A  --Host B
> > tun0 -- ipsec -- eth0  --  eth0 -- ipsec -- tun0
> > 
> > where tun0 are tunnel devices using dst_cache (ipip, ipip6, etc...).
> > 
> > Unregistration of an underlying eth0 device leads to the following
> > log messages:
> > 
> > unregister_netdevice: waiting for eth0 to become free. Usage count
> > = 2
> > 
> > This is since xfrm dsts device references are not released upon
> > device unregistration when the xfrm dst is cached in a dst_cache.
> > 
> > Several approaches for fixing this were discussed in [1]; this
> > commit keeps track of allocated xdsts and releases their device
> > references on a netdev unregister event.
> > 
> > Signed-off-by: Eyal Birger 
> > 
> > [1] https://patchwork.ozlabs.org/patch/869025/
> > ---
> >   include/net/xfrm.h | 10 ++-
> >   net/xfrm/xfrm_policy.c | 81
> > ++ 2 files changed,
> > 84 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> > index 7d20776..c1c8493 100644
> > --- a/include/net/xfrm.h
> > +++ b/include/net/xfrm.h
> > @@ -994,6 +994,8 @@ struct xfrm_dst {
> > u32 child_mtu_cached;
> > u32 route_cookie;
> > u32 path_cookie;
> > +   struct list_head xfrm_dst_gc;
> > +   struct xfrm_dst_list *xfrm_dst_gc_list;
> >   };
> >   
> >   static inline struct dst_entry *xfrm_dst_path(const struct
> > dst_entry *dst) @@ -1025,13 +1027,7 @@ static inline void
> > xfrm_dst_set_child(struct xfrm_dst *xdst, struct dst_entry *c
> > xdst->child = child; }
> >   
> > -static inline void xfrm_dst_destroy(struct xfrm_dst *xdst)
> > -{
> > -   xfrm_pols_put(xdst->pols, xdst->num_pols);
> > -   dst_release(xdst->route);
> > -   if (likely(xdst->u.dst.xfrm))
> > -   xfrm_state_put(xdst->u.dst.xfrm);
> > -}
> > +void xfrm_dst_destroy(struct xfrm_dst *xdst);
> >   #endif
> >   
> >   void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device
> > *dev); diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > index 7a23078..1da8a65 100644
> > --- a/net/xfrm/xfrm_policy.c
> > +++ b/net/xfrm/xfrm_policy.c
> > @@ -45,6 +45,12 @@ struct xfrm_flo {
> > u8 flags;
> >   };
> >   
> > +struct xfrm_dst_list {
> > +   spinlock_t lock; /* sync writers */
> > +   struct list_head head;
> > +};
> > +
> > +static DEFINE_PER_CPU_ALIGNED(struct xfrm_dst_list,
> > xfrm_dst_gc_list); static DEFINE_PER_CPU(struct xfrm_dst *,
> > xfrm_last_dst); static struct work_struct *xfrm_pcpu_work
> > __read_mostly; static DEFINE_SPINLOCK(xfrm_policy_afinfo_lock);
> > @@ -94,6 +100,51 @@ __xfrm6_selector_match(const struct
> > xfrm_selector *sel, const struct flowi *fl) (fl6->flowi6_oif ==
> > sel->ifindex || !sel->ifindex); }
> >   
> > +static void xfrm_dst_add_to_gc_list(struct xfrm_dst *xdst)
> > +{
> > +   struct xfrm_dst_list *xl = raw_cpu_ptr(&xfrm_dst_gc_list);
> > +
> > +   xdst->xfrm_dst_gc_list = xl;
> > +
> > +   spin_lock_bh(&xl->lock);
> > +   list_add_tail(&xdst->xfrm_dst_gc, &xl->head);
> > +   spin_unlock_bh(&xl->lock);
> > +}
> > +
> > +void xfrm_dst_destroy(struct xfrm_dst *xdst)
> > +{
> > +   xfrm_pols_put(xdst->pols, xdst->num_pols);
> > +   dst_release(xdst->route);
> > +   if (likely(xdst->u.dst.xfrm))
> > +   xfrm_state_put(xdst->u.dst.xfrm);
> > +   if (!list_empty(&xdst->xfrm_dst_gc)) {
> > +   struct xfrm_dst_list *xl = xdst->xfrm_dst_gc_list;
> > +
> > +   spin_lock_bh(&xl->lock);
> > +   list_del(&xdst->xfrm_dst_gc);
> > +   spin_unlock_bh(&xl->lock);
> > +   }
> > +}
> > +EXPORT_SYMBOL_GPL(xfrm_dst_destroy);
> > +
> > +static void xfrm_flush_dev(struct net_device *dev)
> > +{
> > +   struct xfrm_dst *xdst;
> > +   int cpu;
> > +
> > +   for_each_possible_cpu(cpu) {
> > +   struct xfrm_dst_list *xl =
> > &per_cpu(xfrm_dst_gc_list, cpu); +
> > +   spin_lock_bh(&xl->lock);
> > +   list_for_each_entry(xdst, &xl->head, xfrm_dst_gc) {
> > +   if (xdst->u.dst

[PATCH ipsec] net: xfrm_policy: fix device unregistration hang

2018-02-12 Thread Eyal Birger
In setups like the following:

   Host A  --Host B
tun0 -- ipsec -- eth0  --  eth0 -- ipsec -- tun0

where tun0 are tunnel devices using dst_cache (ipip, ipip6, etc...).

Unregistration of an underlying eth0 device leads to the following log
messages:

unregister_netdevice: waiting for eth0 to become free. Usage count = 2

This is since xfrm dsts device references are not released upon device
unregistration when the xfrm dst is cached in a dst_cache.

Several approaches for fixing this were discussed in [1]; this commit keeps
track of allocated xdsts and releases their device references on a netdev
unregister event.

Signed-off-by: Eyal Birger 

[1] https://patchwork.ozlabs.org/patch/869025/
---
 include/net/xfrm.h | 10 ++-
 net/xfrm/xfrm_policy.c | 81 ++
 2 files changed, 84 insertions(+), 7 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 7d20776..c1c8493 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -994,6 +994,8 @@ struct xfrm_dst {
u32 child_mtu_cached;
u32 route_cookie;
u32 path_cookie;
+   struct list_head xfrm_dst_gc;
+   struct xfrm_dst_list *xfrm_dst_gc_list;
 };
 
 static inline struct dst_entry *xfrm_dst_path(const struct dst_entry *dst)
@@ -1025,13 +1027,7 @@ static inline void xfrm_dst_set_child(struct xfrm_dst 
*xdst, struct dst_entry *c
xdst->child = child;
 }
 
-static inline void xfrm_dst_destroy(struct xfrm_dst *xdst)
-{
-   xfrm_pols_put(xdst->pols, xdst->num_pols);
-   dst_release(xdst->route);
-   if (likely(xdst->u.dst.xfrm))
-   xfrm_state_put(xdst->u.dst.xfrm);
-}
+void xfrm_dst_destroy(struct xfrm_dst *xdst);
 #endif
 
 void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev);
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7a23078..1da8a65 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -45,6 +45,12 @@ struct xfrm_flo {
u8 flags;
 };
 
+struct xfrm_dst_list {
+   spinlock_t lock; /* sync writers */
+   struct list_head head;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct xfrm_dst_list, xfrm_dst_gc_list);
 static DEFINE_PER_CPU(struct xfrm_dst *, xfrm_last_dst);
 static struct work_struct *xfrm_pcpu_work __read_mostly;
 static DEFINE_SPINLOCK(xfrm_policy_afinfo_lock);
@@ -94,6 +100,51 @@ __xfrm6_selector_match(const struct xfrm_selector *sel, 
const struct flowi *fl)
(fl6->flowi6_oif == sel->ifindex || !sel->ifindex);
 }
 
+static void xfrm_dst_add_to_gc_list(struct xfrm_dst *xdst)
+{
+   struct xfrm_dst_list *xl = raw_cpu_ptr(&xfrm_dst_gc_list);
+
+   xdst->xfrm_dst_gc_list = xl;
+
+   spin_lock_bh(&xl->lock);
+   list_add_tail(&xdst->xfrm_dst_gc, &xl->head);
+   spin_unlock_bh(&xl->lock);
+}
+
+void xfrm_dst_destroy(struct xfrm_dst *xdst)
+{
+   xfrm_pols_put(xdst->pols, xdst->num_pols);
+   dst_release(xdst->route);
+   if (likely(xdst->u.dst.xfrm))
+   xfrm_state_put(xdst->u.dst.xfrm);
+   if (!list_empty(&xdst->xfrm_dst_gc)) {
+   struct xfrm_dst_list *xl = xdst->xfrm_dst_gc_list;
+
+   spin_lock_bh(&xl->lock);
+   list_del(&xdst->xfrm_dst_gc);
+   spin_unlock_bh(&xl->lock);
+   }
+}
+EXPORT_SYMBOL_GPL(xfrm_dst_destroy);
+
+static void xfrm_flush_dev(struct net_device *dev)
+{
+   struct xfrm_dst *xdst;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   struct xfrm_dst_list *xl = &per_cpu(xfrm_dst_gc_list, cpu);
+
+   spin_lock_bh(&xl->lock);
+   list_for_each_entry(xdst, &xl->head, xfrm_dst_gc) {
+   if (xdst->u.dst.dev != dev)
+   continue;
+   dst_dev_put(&xdst->u.dst);
+   }
+   spin_unlock_bh(&xl->lock);
+   }
+}
+
 bool xfrm_selector_match(const struct xfrm_selector *sel, const struct flowi 
*fl,
 unsigned short family)
 {
@@ -1581,6 +1632,8 @@ static struct dst_entry *xfrm_bundle_create(struct 
xfrm_policy *policy,
}
 
bundle[i] = xdst;
+   xfrm_dst_add_to_gc_list(xdst);
+
if (!xdst_prev)
xdst0 = xdst;
else
@@ -2937,6 +2990,20 @@ static void xfrm_policy_fini(struct net *net)
xfrm_hash_free(net->xfrm.policy_byidx, sz);
 }
 
+static int xfrm_netdev_event(struct notifier_block *this, unsigned long event,
+void *ptr)
+{
+   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+   if (event == NETDEV_UNREGISTER)
+   xfrm_flush_dev(dev);
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block xfrm_netdev_notifier = {
+   .noti

Re: xfrm, ip tunnel: non released device reference upon device unregistration

2018-02-12 Thread Eyal Birger
On Sun, 11 Feb 2018 16:46:48 +0100
Florian Westphal  wrote:

> Eyal Birger  wrote:
> 
> Sorry for taking so long to respond.
> 
> > On Tue, 6 Feb 2018 14:15:09 +0100
> > Florian Westphal  wrote:
> >   
> > > Steffen Klassert  wrote:  
> > > > I gave the patch a quick try, but still I get this:
> > > > 
> > > > unregister_netdevice: waiting for dummy1 to become free. Usage
> > > > count = 2
> > > 
> > > Was that with Eyals setup or the bridge one I posted?
> > > 
> > > If it was Eyals setup, its possible the patch missed hookup
> > > to whatever tunnel infra is used (the setup I used has ipip
> > > tunnel, everything is ipv4).
> > >   
> > 
> > Thanks!
> > 
> > Indeed the setup I'm testing uses ip6_tunnel.
> > I have tested a fix in the spirit of the patch and it looks valid 
> > for ip6_tunnel as well.
> >
> > It looks though that this change would need to be added to any
> > tunnel device using dst_cache (vxlan, geneve, gre, ...).  
> 
> Yes.  Meanwhile I tested your patch and it works for me too.
> As your patch is shorter and ipv4/ipv6 seem to take care of refcount
> put just fine I think your patch is the right way to go.
> 
> The xfrm_dst size incrase isn't much of a big deal, there is ample of
> padding at the end so it will still be allocated from same slab.
> 
> We could reduce num_pols and num_xfrms to u8, which creates a 16 bit
> hole, then store the cpu number instead of a list pointer.
> 
> This would limit growth to 16 instead of 24.
> 
> But, as I said, i do not think its a big deal.
> 
> > I'm wondering - non-xfrm dsts are already correctly invalidated,
> > so do you think it makes sense to invalidate caches for devices that
> > have no xfrm dsts? or maybe I didn't understand the suggestion?  
> 
> See above, I think your patch is the way to go.

Ok, thanks. Will submit a formal patch.

Eyal.



Re: xfrm, ip tunnel: non released device reference upon device unregistration

2018-02-06 Thread Eyal Birger
On Tue, 6 Feb 2018 14:15:09 +0100
Florian Westphal  wrote:

> Steffen Klassert  wrote:
> > I gave the patch a quick try, but still I get this:
> > 
> > unregister_netdevice: waiting for dummy1 to become free. Usage
> > count = 2  
> 
> Was that with Eyals setup or the bridge one I posted?
> 
> If it was Eyals setup, its possible the patch missed hookup
> to whatever tunnel infra is used (the setup I used has ipip tunnel,
> everything is ipv4).
> 

Thanks!

Indeed the setup I'm testing uses ip6_tunnel.
I have tested a fix in the spirit of the patch and it looks valid 
for ip6_tunnel as well.

It looks though that this change would need to be added to any tunnel
device using dst_cache (vxlan, geneve, gre, ...).

> Also, perhaps it would be best to not bother with checking the
> device in question at all and unconditionally put device reference
> of all the dst_caches.  With setups that have e.g. 1k devices going
> down per second (ppp dialin and the like) doing the full search for
> every notify event would be rather expensive.
> 

I'm wondering - non-xfrm dsts are already correctly invalidated,
so do you think it makes sense to invalidate caches for devices that
have no xfrm dsts? or maybe I didn't understand the suggestion?

Eyal.


Re: xfrm, ip tunnel: non released device reference upon device unregistration

2018-02-06 Thread Eyal Birger
Hi Steffen,

On Tue, 6 Feb 2018 09:53:38 +0100
Steffen Klassert  wrote:

> Cc Wei Wang
> 
> On Sun, Feb 04, 2018 at 01:21:18PM +0200, Eyal Birger wrote:
> > Hi,
> > 
> > We've encountered a non released device reference upon device
> > unregistration which seems to stem from xfrm policy code.
> > 
> > The setup includes:
> > - an underlay device (e.g. eth0) using IPv4
> > - an xfrm IPv6 over IPv4 tunnel routed via the underlay device
> > - an ipip6 tunnel over the xfrm IPv6 tunnel
> > 
> > When tearing down the underlay device, after traffic had passed via
> > the ipip6 tunnel, log messages of the following form are observed:
> > 
> > unregister_netdevice: waiting for eth0 to become free. Usage count
> > = 2  
> 
> Looks like this happened when the dst garbage collection code was
> removed. I could not point to a commit that introduced it so I
> did a bisection and this pointed to:
> 
> commit 9514528d92d4cbe086499322370155ed69f5d06c
> ipv6: call dst_dev_put() properly
> 
> With this commit we leak the one refcount and some further commit
> leaked the second one.
> 
> > 
> > The below synthetic script reproduces this consistently on a fresh
> > ubuntu vm running net-next v4.15-6066-ge9522a5:
> > -
> > #!/bin/bash
> > 
> > ipsec_underlay_dst=192.168.6.1
> > ipsec_underlay_src=192.168.5.2
> > ipv6_pfx=1234
> > local_ipv6_addr="$ipv6_pfx::1"
> > remote_ipv6_addr="$ipv6_pfx::2"
> > 
> > # create dummy ipsec underlay
> > ip l add dev dummy1 type dummy
> > ip l set dev dummy1 up
> > ip r add "$ipsec_underlay_dst/32" dev dummy1
> > ip -6 r add "$ipv6_pfx::/16" dev dummy1
> > 
> > ip a add dev dummy1 "$local_ipv6_addr/128"
> > ip a add dev dummy1 "$ipsec_underlay_src/24"
> > 
> > # add xfrm policy and state
> > ip x p add src "$local_ipv6_addr/128" dst "$ipv6_pfx::/16" dir out
> > tmpl src "$ipsec_underlay_src" dst "$ipsec_underlay_dst" proto esp
> > reqid 1 mode tunnel ip x s add src "$ipsec_underlay_src" dst
> > "$ipsec_underlay_dst" proto esp spi 0xcd440ce6 reqid 1 mode tunnel
> > auth-trunc 'hmac(sha1)' 0x34a546d309031628962b814ef073aff1a638ad21
> > 96 enc 'cbc(aes)' 0xf31e14149c328297fe7925ad7448420e encap espinudp
> > 4500 4500 0.0.0.0
> > 
> > # add 4o6 tunnel
> > ip l add tnl46 type ip6tnl mode ipip6 local "$local_ipv6_addr"
> > remote "$remote_ipv6_addr" ip l set dev tnl46 up
> > ip r add 10.64.0.0/10 dev tnl46 
> > 
> > # pass traffic so route is cached
> > ping -w 1 -c 1 10.64.0.1
> > 
> > # remove dummy underlay
> > ip l del dummy1
> > -
> > 
> > Analysis:
> > 
> > ip6_tunnel holds a dst_cache which caches its underlay dst objects.
> > When devices are unregistered, non-xfrm dst objects are invlidated
> > by their original creators (ipv4/ipv6/...) and thus are wiped from
> > dst_cache.
> > 
> > xfrm created routes otoh are not tracked by xfrm, and are not
> > invalidated upon device unregistration, thus hold the device upon
> > unregistration.
> > 
> > The following rough sketch patch illustrates an approach overcoming
> > this issue:
> > -  

[snip]

> > -
> > 
> > This approach has the unfortunate side effects of adding a spin
> > lock for the tracked list, as well as increasing struct xfrm_dst.  
> 
> Reintroducing garbage collection is probably not a so good idea. I
> think the patch below should fix it a bit less intrusive.
> 
> 
> Subject: [PATCH RFC] xfrm: Fix netdev refcount leak when flushing the
> percpu dst cache.
> 
> The dst garbage collection code is removed, so we need to call
> dst_dev_put() on cached dst entries before we release them.
> Otherwise we leak the refcount to the netdev.
> 
> Signed-off-by: Steffen Klassert 
> ---
>  net/xfrm/xfrm_policy.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index 7a23078132cf..7836b7601b49 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -1715,8 +1715,10 @@ static int xfrm_expand_policies(const struct
> flowi *fl, u16 family, static void xfrm_last_dst_update(struct
> xfrm_dst *xdst, struct xfrm_dst *ol

Re: xfrm, ip tunnel: non released device reference upon device unregistration

2018-02-04 Thread Eyal Birger
On Sun, 4 Feb 2018 13:21:18 +0200
Eyal Birger  wrote:

> Hi,
> 
> We've encountered a non released device reference upon device
> unregistration which seems to stem from xfrm policy code.
> 
> The setup includes:
> - an underlay device (e.g. eth0) using IPv4
> - an xfrm IPv6 over IPv4 tunnel routed via the underlay device
> - an ipip6 tunnel over the xfrm IPv6 tunnel
> 
> When tearing down the underlay device, after traffic had passed via
> the ipip6 tunnel, log messages of the following form are observed:
> 
> unregister_netdevice: waiting for eth0 to become free. Usage count = 2
> 
> The below synthetic script reproduces this consistently on a fresh
> ubuntu vm running net-next v4.15-6066-ge9522a5:

Minor correction: net-next version v4.15-10442-g617aebe

Eyal.


xfrm, ip tunnel: non released device reference upon device unregistration

2018-02-04 Thread Eyal Birger
Hi,

We've encountered a non released device reference upon device
unregistration which seems to stem from xfrm policy code.

The setup includes:
- an underlay device (e.g. eth0) using IPv4
- an xfrm IPv6 over IPv4 tunnel routed via the underlay device
- an ipip6 tunnel over the xfrm IPv6 tunnel

When tearing down the underlay device, after traffic had passed via the ipip6
tunnel, log messages of the following form are observed:

unregister_netdevice: waiting for eth0 to become free. Usage count = 2

The below synthetic script reproduces this consistently on a fresh ubuntu vm
running net-next v4.15-6066-ge9522a5:
-
#!/bin/bash

ipsec_underlay_dst=192.168.6.1
ipsec_underlay_src=192.168.5.2
ipv6_pfx=1234
local_ipv6_addr="$ipv6_pfx::1"
remote_ipv6_addr="$ipv6_pfx::2"

# create dummy ipsec underlay
ip l add dev dummy1 type dummy
ip l set dev dummy1 up
ip r add "$ipsec_underlay_dst/32" dev dummy1
ip -6 r add "$ipv6_pfx::/16" dev dummy1

ip a add dev dummy1 "$local_ipv6_addr/128"
ip a add dev dummy1 "$ipsec_underlay_src/24"

# add xfrm policy and state
ip x p add src "$local_ipv6_addr/128" dst "$ipv6_pfx::/16" dir out tmpl src 
"$ipsec_underlay_src" dst "$ipsec_underlay_dst" proto esp reqid 1 mode tunnel
ip x s add src "$ipsec_underlay_src" dst "$ipsec_underlay_dst" proto esp spi 
0xcd440ce6 reqid 1 mode tunnel auth-trunc 'hmac(sha1)' 
0x34a546d309031628962b814ef073aff1a638ad21 96 enc 'cbc(aes)' 
0xf31e14149c328297fe7925ad7448420e encap espinudp 4500 4500 0.0.0.0

# add 4o6 tunnel
ip l add tnl46 type ip6tnl mode ipip6 local "$local_ipv6_addr" remote 
"$remote_ipv6_addr"
ip l set dev tnl46 up
ip r add 10.64.0.0/10 dev tnl46 

# pass traffic so route is cached
ping -w 1 -c 1 10.64.0.1

# remove dummy underlay
ip l del dummy1
-

Analysis:

ip6_tunnel holds a dst_cache which caches its underlay dst objects.
When devices are unregistered, non-xfrm dst objects are invlidated by their
original creators (ipv4/ipv6/...) and thus are wiped from dst_cache.

xfrm created routes otoh are not tracked by xfrm, and are not invalidated upon
device unregistration, thus hold the device upon unregistration.

The following rough sketch patch illustrates an approach overcoming this
issue:
-
From e188dc5295e3500bc59e8780049840afa2eb3e24 Mon Sep 17 00:00:00 2001
From: Eyal Birger 
Date: Sun, 4 Feb 2018 13:08:02 +0200
Subject: [PATCH] net: xfrm_policy: invalidate xfrm_dsts on device
 unregistration

---
 include/net/xfrm.h | 10 ++-
 net/xfrm/xfrm_policy.c | 75 ++
 2 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 7d20776..c1c8493 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -994,6 +994,8 @@ struct xfrm_dst {
u32 child_mtu_cached;
u32 route_cookie;
u32 path_cookie;
+   struct list_head xfrm_dst_gc;
+   struct xfrm_dst_list *xfrm_dst_gc_list;
 };
 
 static inline struct dst_entry *xfrm_dst_path(const struct dst_entry *dst)
@@ -1025,13 +1027,7 @@ static inline void xfrm_dst_set_child(struct xfrm_dst 
*xdst, struct dst_entry *c
xdst->child = child;
 }
 
-static inline void xfrm_dst_destroy(struct xfrm_dst *xdst)
-{
-   xfrm_pols_put(xdst->pols, xdst->num_pols);
-   dst_release(xdst->route);
-   if (likely(xdst->u.dst.xfrm))
-   xfrm_state_put(xdst->u.dst.xfrm);
-}
+void xfrm_dst_destroy(struct xfrm_dst *xdst);
 #endif
 
 void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev);
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7a23078..d446641 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -94,6 +94,58 @@ __xfrm6_selector_match(const struct xfrm_selector *sel, 
const struct flowi *fl)
(fl6->flowi6_oif == sel->ifindex || !sel->ifindex);
 }
 
+struct xfrm_dst_list {
+   spinlock_t  lock;
+   struct list_headhead;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct xfrm_dst_list, xfrm_dst_gc_list);
+
+static void xfrm_add_dst_list(struct xfrm_dst *xdst)
+{
+   struct xfrm_dst_list *xl = raw_cpu_ptr(&xfrm_dst_gc_list);
+
+   xdst->xfrm_dst_gc_list = xl;
+
+   spin_lock_bh(&xl->lock);
+   list_add_tail(&xdst->xfrm_dst_gc, &xl->head);
+   spin_unlock_bh(&xl->lock);
+}
+
+void xfrm_dst_destroy(struct xfrm_dst *xdst)
+{
+   xfrm_pols_put(xdst->pols, xdst->num_pols);
+   dst_release(xdst->route);
+   if (likely(xdst->u.dst.xfrm))
+   xfrm_state_put(xdst->u.dst.xfrm);
+   if (!list_empty(&xdst->xfrm_dst_gc)) {
+   

Re: [PATCH net-next,v2 2/2] net: sched: add em_ipt ematch for calling xtables matches

2018-01-30 Thread Eyal Birger
On Sun, 28 Jan 2018 19:22:12 -0800
Cong Wang  wrote:

> On Fri, Jan 26, 2018 at 11:57 AM, Eyal Birger 
> wrote:
> > On Fri, Jan 26, 2018 at 8:50 PM, Pablo Neira Ayuso
> >  wrote:  
> >> Isn't there a way to reject the use of this from ->change()? ie.
> >> from control plane configuration.  
> >
> > I wasn't able to find a simple way of doing so:
> >
> > - AFAIU tc filters are detached from the qdiscs they operate on via
> > tcf_block instances
> >   that may be shared by different qdiscs. I was not able to be sure
> > that filters attached to ingress qdiscs via tcf_blocks at
> > configuration time cannot be later be shared
> >   with non ingress qdiscs. Nor was I able to find another classifier
> > making the ingress/egress
> >   distinction at configuration time.
> >
> > - ematches are not provided with 'ingress/egress' information at
> > 'change()' invocation, though
> >   of course the infrastructure could be extended to provide this,
> > given the distinction is available.
> >  
> 
> In the past you can check tp->q, but now we support shared tc filter
> block, so it is hard. I think your v1 is okay, which just silently
> passes the match on egress side. Or maybe we can just add a pr_info()
> unconditionally in em_ipt_change() saying only ingress is supported.

Thanks!

The motivation for allowing only ingress was to avoid skb modifications
on egress as when running the match on egress, skb->data must point to
the L3 header. Looking again at the calling flow e.g. from __dev_queue_xmit(),
I don't see a case where skb may be shared.

Similarly on ingress flow, sch_handle_ingress() modifies the skb, and
tc actions perform skb modification without share checking.

So as far as I can tell skb_pull() on the match is safe.
Is there a different code path I should be looking for?

If that is the case, perhaps the v1 approach supporting both directions
including skb_pull() can be resubmitted without the pr_notice once
net-next is open.

Eyal.






Re: [PATCH net-next,v2 2/2] net: sched: add em_ipt ematch for calling xtables matches

2018-01-26 Thread Eyal Birger
On Fri, Jan 26, 2018 at 8:50 PM, Pablo Neira Ayuso  wrote:
> On Fri, Jan 26, 2018 at 06:48:53PM +0200, Eyal Birger wrote:
>> diff --git a/net/sched/em_ipt.c b/net/sched/em_ipt.c
>> new file mode 100644
>> index 000..2103b30
>> --- /dev/null
>> +++ b/net/sched/em_ipt.c
> [...]
>> +static int em_ipt_match(struct sk_buff *skb, struct tcf_ematch *em,
>> + struct tcf_pkt_info *info)
>> +{
>> + const struct em_ipt_match *im = (const void *)em->data;
>> + struct xt_action_param acpar = {};
>> + struct net_device *indev = NULL;
>> + struct nf_hook_state state;
>> + int ret;
>> +
>> + if (unlikely(!skb_at_tc_ingress(skb))) {
>> + pr_notice_once("ipt match must not be used at egress\n");
>
> Isn't there a way to reject the use of this from ->change()? ie. from
> control plane configuration.

I wasn't able to find a simple way of doing so:

- AFAIU tc filters are detached from the qdiscs they operate on via
tcf_block instances
  that may be shared by different qdiscs. I was not able to be sure that filters
  attached to ingress qdiscs via tcf_blocks at configuration time
cannot be later be shared
  with non ingress qdiscs. Nor was I able to find another classifier
making the ingress/egress
  distinction at configuration time.

- ematches are not provided with 'ingress/egress' information at
'change()' invocation, though
  of course the infrastructure could be extended to provide this,
given the distinction is available.

Eyal.


[PATCH net-next,v2 1/2] net: sched: ematch: pass protocol to ematch 'change()' handlers

2018-01-26 Thread Eyal Birger
From: Eyal Birger 

In order to allow ematches to create their internal state based on the
L3 protocol specified when creating the filter.

Signed-off-by: Eyal Birger 
---
 include/net/pkt_cls.h | 2 +-
 net/sched/em_canid.c  | 4 ++--
 net/sched/em_ipset.c  | 4 ++--
 net/sched/em_meta.c   | 2 +-
 net/sched/em_nbyte.c  | 4 ++--
 net/sched/em_text.c   | 2 +-
 net/sched/ematch.c| 3 ++-
 7 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 8740625..ff5fcb1 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -474,7 +474,7 @@ struct tcf_ematch_tree {
 struct tcf_ematch_ops {
int kind;
int datalen;
-   int (*change)(struct net *net, void *,
+   int (*change)(struct net *net, __be16, void *,
  int, struct tcf_ematch *);
int (*match)(struct sk_buff *, struct tcf_ematch *,
 struct tcf_pkt_info *);
diff --git a/net/sched/em_canid.c b/net/sched/em_canid.c
index ddd883c..445c10d 100644
--- a/net/sched/em_canid.c
+++ b/net/sched/em_canid.c
@@ -120,8 +120,8 @@ static int em_canid_match(struct sk_buff *skb, struct 
tcf_ematch *m,
return match;
 }
 
-static int em_canid_change(struct net *net, void *data, int len,
- struct tcf_ematch *m)
+static int em_canid_change(struct net *net, __be16 protocol, void *data,
+  int len, struct tcf_ematch *m)
 {
struct can_filter *conf = data; /* Array with rules */
struct canid_match *cm;
diff --git a/net/sched/em_ipset.c b/net/sched/em_ipset.c
index c1b23e3..50f7282 100644
--- a/net/sched/em_ipset.c
+++ b/net/sched/em_ipset.c
@@ -19,8 +19,8 @@
 #include 
 #include 
 
-static int em_ipset_change(struct net *net, void *data, int data_len,
-  struct tcf_ematch *em)
+static int em_ipset_change(struct net *net, __be16 protocol, void *data,
+  int data_len, struct tcf_ematch *em)
 {
struct xt_set_info *set = data;
ip_set_id_t index;
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index d6e9711..6892efc 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -904,7 +904,7 @@ static const struct nla_policy meta_policy[TCA_EM_META_MAX 
+ 1] = {
[TCA_EM_META_HDR]   = { .len = sizeof(struct tcf_meta_hdr) },
 };
 
-static int em_meta_change(struct net *net, void *data, int len,
+static int em_meta_change(struct net *net, __be16 protocol, void *data, int 
len,
  struct tcf_ematch *m)
 {
int err;
diff --git a/net/sched/em_nbyte.c b/net/sched/em_nbyte.c
index 07c10ba..62a7a2e 100644
--- a/net/sched/em_nbyte.c
+++ b/net/sched/em_nbyte.c
@@ -23,8 +23,8 @@ struct nbyte_data {
charpattern[0];
 };
 
-static int em_nbyte_change(struct net *net, void *data, int data_len,
-  struct tcf_ematch *em)
+static int em_nbyte_change(struct net *net, __be16 protocol, void *data,
+  int data_len, struct tcf_ematch *em)
 {
struct tcf_em_nbyte *nbyte = data;
 
diff --git a/net/sched/em_text.c b/net/sched/em_text.c
index 73e2ed5..b5d9e21 100644
--- a/net/sched/em_text.c
+++ b/net/sched/em_text.c
@@ -44,7 +44,7 @@ static int em_text_match(struct sk_buff *skb, struct 
tcf_ematch *m,
return skb_find_text(skb, from, to, tm->config) != UINT_MAX;
 }
 
-static int em_text_change(struct net *net, void *data, int len,
+static int em_text_change(struct net *net, __be16 protocol, void *data, int 
len,
  struct tcf_ematch *m)
 {
struct text_match *tm;
diff --git a/net/sched/ematch.c b/net/sched/ematch.c
index 1331a4c..a69abd8 100644
--- a/net/sched/ematch.c
+++ b/net/sched/ematch.c
@@ -242,7 +242,8 @@ static int tcf_em_validate(struct tcf_proto *tp,
goto errout;
 
if (em->ops->change) {
-   err = em->ops->change(net, data, data_len, em);
+   err = em->ops->change(net, tp->protocol, data, data_len,
+ em);
if (err < 0)
goto errout;
} else if (data_len > 0) {
-- 
2.7.4



[PATCH net-next,v2 2/2] net: sched: add em_ipt ematch for calling xtables matches

2018-01-26 Thread Eyal Birger
From: Eyal Birger 

This module allows performing tc classification based on data structures
and implementations provided by netfilter extensions.

Example use case is classification based on the incoming IPSec policy used
during decpsulation using the 'policy' iptables extension (xt_policy).

Only tc ingress is supported in order to avoid modifying the skb at egress
to suit xtable matches skb->data expectations.

Signed-off-by: Eyal Birger 
Acked-by: Pablo Neira Ayuso 

---

This ematch tries its best to provide matches with a similar
environment to the one they are used to; Due to the wide range of criteria
supported by matches, I am not able to test every combination of match and
traffic.
---
 include/uapi/linux/pkt_cls.h |   3 +-
 include/uapi/linux/tc_ematch/tc_em_ipt.h |  19 +++
 net/sched/Kconfig|  10 ++
 net/sched/Makefile   |   1 +
 net/sched/em_ipt.c   | 244 +++
 5 files changed, 276 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/tc_ematch/tc_em_ipt.h
 create mode 100644 net/sched/em_ipt.c

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 46c5066..7cafb26d 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -555,7 +555,8 @@ enum {
 #defineTCF_EM_VLAN 6
 #defineTCF_EM_CANID7
 #defineTCF_EM_IPSET8
-#defineTCF_EM_MAX  8
+#defineTCF_EM_IPT  9
+#defineTCF_EM_MAX  9
 
 enum {
TCF_EM_PROG_TC
diff --git a/include/uapi/linux/tc_ematch/tc_em_ipt.h 
b/include/uapi/linux/tc_ematch/tc_em_ipt.h
new file mode 100644
index 000..aecd252
--- /dev/null
+++ b/include/uapi/linux/tc_ematch/tc_em_ipt.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __LINUX_TC_EM_IPT_H
+#define __LINUX_TC_EM_IPT_H
+
+#include 
+#include 
+
+enum {
+   TCA_EM_IPT_UNSPEC,
+   TCA_EM_IPT_HOOK,
+   TCA_EM_IPT_MATCH_NAME,
+   TCA_EM_IPT_MATCH_REVISION,
+   TCA_EM_IPT_MATCH_DATA,
+   __TCA_EM_IPT_MAX
+};
+
+#define TCA_EM_IPT_MAX (__TCA_EM_IPT_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index c03d86a..203e7f7 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -658,6 +658,16 @@ config NET_EMATCH_IPSET
  To compile this code as a module, choose M here: the
  module will be called em_ipset.
 
+config NET_EMATCH_IPT
+   tristate "IPtables Matches"
+   depends on NET_EMATCH && NETFILTER && NETFILTER_XTABLES
+   ---help---
+ Say Y here to be able to classify packets based on iptables
+ matches.
+
+ To compile this code as a module, choose M here: the
+ module will be called em_ipt.
+
 config NET_CLS_ACT
bool "Actions"
select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5b63544..8811d38 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -75,3 +75,4 @@ obj-$(CONFIG_NET_EMATCH_META) += em_meta.o
 obj-$(CONFIG_NET_EMATCH_TEXT)  += em_text.o
 obj-$(CONFIG_NET_EMATCH_CANID) += em_canid.o
 obj-$(CONFIG_NET_EMATCH_IPSET) += em_ipset.o
+obj-$(CONFIG_NET_EMATCH_IPT)   += em_ipt.o
diff --git a/net/sched/em_ipt.c b/net/sched/em_ipt.c
new file mode 100644
index 000..2103b30
--- /dev/null
+++ b/net/sched/em_ipt.c
@@ -0,0 +1,244 @@
+/*
+ * net/sched/em_ipt.c IPtables matches Ematch
+ *
+ * (c) 2018 Eyal Birger 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct em_ipt_match {
+   const struct xt_match *match;
+   u32 hook;
+   u8 nfproto;
+   u8 match_data[0] __aligned(8);
+};
+
+static int check_match(struct net *net, struct em_ipt_match *im, int mdata_len)
+{
+   struct xt_mtchk_param mtpar = {};
+   union {
+   struct ipt_entry e4;
+   struct ip6t_entry e6;
+   } e = {};
+
+   mtpar.net   = net;
+   mtpar.table = "filter";
+   mtpar.hook_mask = 1 << im->hook;
+   mtpar.family= im->nfproto;
+   mtpar.match = im->match;
+   mtpar.entryinfo = &e;
+   mtpar.matchinfo = (void *)im->match_data;
+   return xt_check_match(&mtpar, mdata_len, 0, 0);
+}
+
+static const struct nla_policy em_ipt_policy[TCA_EM_IPT_MAX + 1] = {
+   [TCA_EM_IPT_HOOK]   = { .type = NLA_U32 },
+   [TCA_EM_IPT_MATCH_NAME] = { .type = NLA_STRING,
+   .len = 

[PATCH net-next,v2 0/2] net: sched: introduce em_ipt ematch

2018-01-26 Thread Eyal Birger
From: Eyal Birger 

The following patchset introduces a new tc ematch for matching using
netfilter matches.

This allows early classification as well as mirroning/redirecting traffic
based on logic implemented in netfilter extensions.

Example use case is classification based on the incoming IPSec policy used
during decpsulation using the 'policy' iptables extension (xt_policy).

This patchset is an enhancement of a former series ([1]) which allowed only
policy matching following a suggestion by Pablo Neira Ayuso ([2]).

[1] https://patchwork.ozlabs.org/cover/859887/
[2] https://patchwork.ozlabs.org/patch/859888/

v2:
  Remove skb push/pull and limit functionality to ingress

Eyal Birger (2):
  net: sched: ematch: pass protocol to ematch 'change()' handlers
  net: sched: add em_ipt ematch for calling xtables matches

 include/net/pkt_cls.h|   2 +-
 include/uapi/linux/pkt_cls.h |   3 +-
 include/uapi/linux/tc_ematch/tc_em_ipt.h |  19 +++
 net/sched/Kconfig|  10 ++
 net/sched/Makefile   |   1 +
 net/sched/em_canid.c |   4 +-
 net/sched/em_ipset.c |   4 +-
 net/sched/em_ipt.c   | 244 +++
 net/sched/em_meta.c  |   2 +-
 net/sched/em_nbyte.c |   4 +-
 net/sched/em_text.c  |   2 +-
 net/sched/ematch.c   |   3 +-
 12 files changed, 287 insertions(+), 11 deletions(-)
 create mode 100644 include/uapi/linux/tc_ematch/tc_em_ipt.h
 create mode 100644 net/sched/em_ipt.c

-- 
2.7.4



Re: [PATCH net-next 2/2] net: sched: add em_ipt ematch for calling xtables matches

2018-01-25 Thread Eyal Birger
On Thu, Jan 25, 2018 at 2:00 AM, Pablo Neira Ayuso  wrote:
> On Wed, Jan 24, 2018 at 04:37:16PM -0500, David Miller wrote:
>> From: Eyal Birger 
>> Date: Tue, 23 Jan 2018 11:17:32 +0200
>>
>> > +   network_offset = skb_network_offset(skb);
>> > +   skb_pull(skb, network_offset);
>> > +
>> > +   rcu_read_lock();
>> > +
>> > +   if (skb->skb_iif)
>> > +   indev = dev_get_by_index_rcu(em->net, skb->skb_iif);
>> > +
>> > +   nf_hook_state_init(&state, im->hook, im->nfproto, indev ?: skb->dev,
>> > +  skb->dev, NULL, em->net, NULL);
>> > +
>> > +   acpar.match = im->match;
>> > +   acpar.matchinfo = im->match_data;
>> > +   acpar.state = &state;
>> > +
>> > +   ret = im->match->match(skb, &acpar);
>> > +
>> > +   rcu_read_unlock();
>> > +
>> > +   skb_push(skb, network_offset);
>>
>> If the SKB is shared in any way, this pull/push around the NF hook
>> invocation is illegal.
>
> At ingress, skb->data points to the network header, which is what the
> xtables matches expect, so these are actually noops, therefore,
> skb_pull() and skb_push() can be removed.

Right. I added those for completeness in supporting the xmit path.
In the xt_policy use-case it is irrelevant as tc is invoked after
encapsulation.

I will submit a v2 without these, asserting the ingress direction.

Note I have followed the example in em_ipset.c, so that might be a
problem there too...

Thanks!
Eyal.


[PATCH net-next 2/2] net: sched: add em_ipt ematch for calling xtables matches

2018-01-23 Thread Eyal Birger
From: Eyal Birger 

This module allows performing tc classification based on data structures
and implementations provided by netfilter extensions.

Example use case is classification based on the incoming IPSec policy used
during decpsulation using the 'policy' iptables extension (xt_policy).

Signed-off-by: Eyal Birger 

---

This ematch tries its best to provide matches with a similar
environment to the one they are used to; Due to the wide range of criteria
supported by matches, I am not able to test every combination of match and
traffic.
---
 include/uapi/linux/pkt_cls.h |   3 +-
 include/uapi/linux/tc_ematch/tc_em_ipt.h |  19 +++
 net/sched/Kconfig|  10 ++
 net/sched/Makefile   |   1 +
 net/sched/em_ipt.c   | 244 +++
 5 files changed, 276 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/tc_ematch/tc_em_ipt.h
 create mode 100644 net/sched/em_ipt.c

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 46c5066..7cafb26d 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -555,7 +555,8 @@ enum {
 #defineTCF_EM_VLAN 6
 #defineTCF_EM_CANID7
 #defineTCF_EM_IPSET8
-#defineTCF_EM_MAX  8
+#defineTCF_EM_IPT  9
+#defineTCF_EM_MAX  9
 
 enum {
TCF_EM_PROG_TC
diff --git a/include/uapi/linux/tc_ematch/tc_em_ipt.h 
b/include/uapi/linux/tc_ematch/tc_em_ipt.h
new file mode 100644
index 000..aecd252
--- /dev/null
+++ b/include/uapi/linux/tc_ematch/tc_em_ipt.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __LINUX_TC_EM_IPT_H
+#define __LINUX_TC_EM_IPT_H
+
+#include 
+#include 
+
+enum {
+   TCA_EM_IPT_UNSPEC,
+   TCA_EM_IPT_HOOK,
+   TCA_EM_IPT_MATCH_NAME,
+   TCA_EM_IPT_MATCH_REVISION,
+   TCA_EM_IPT_MATCH_DATA,
+   __TCA_EM_IPT_MAX
+};
+
+#define TCA_EM_IPT_MAX (__TCA_EM_IPT_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index c03d86a..203e7f7 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -658,6 +658,16 @@ config NET_EMATCH_IPSET
  To compile this code as a module, choose M here: the
  module will be called em_ipset.
 
+config NET_EMATCH_IPT
+   tristate "IPtables Matches"
+   depends on NET_EMATCH && NETFILTER && NETFILTER_XTABLES
+   ---help---
+ Say Y here to be able to classify packets based on iptables
+ matches.
+
+ To compile this code as a module, choose M here: the
+ module will be called em_ipt.
+
 config NET_CLS_ACT
bool "Actions"
select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5b63544..8811d38 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -75,3 +75,4 @@ obj-$(CONFIG_NET_EMATCH_META) += em_meta.o
 obj-$(CONFIG_NET_EMATCH_TEXT)  += em_text.o
 obj-$(CONFIG_NET_EMATCH_CANID) += em_canid.o
 obj-$(CONFIG_NET_EMATCH_IPSET) += em_ipset.o
+obj-$(CONFIG_NET_EMATCH_IPT)   += em_ipt.o
diff --git a/net/sched/em_ipt.c b/net/sched/em_ipt.c
new file mode 100644
index 000..9629031
--- /dev/null
+++ b/net/sched/em_ipt.c
@@ -0,0 +1,244 @@
+/*
+ * net/sched/em_ipt.c IPtables matches Ematch
+ *
+ * (c) 2018 Eyal Birger 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct em_ipt_match {
+   const struct xt_match *match;
+   u32 hook;
+   u8 nfproto;
+   u8 match_data[0] __aligned(8);
+};
+
+static int check_match(struct net *net, struct em_ipt_match *im, int mdata_len)
+{
+   struct xt_mtchk_param mtpar = {};
+   union {
+   struct ipt_entry e4;
+   struct ip6t_entry e6;
+   } e = {};
+
+   mtpar.net   = net;
+   mtpar.table = "filter";
+   mtpar.hook_mask = 1 << im->hook;
+   mtpar.family= im->nfproto;
+   mtpar.match = im->match;
+   mtpar.entryinfo = &e;
+   mtpar.matchinfo = (void *)im->match_data;
+   return xt_check_match(&mtpar, mdata_len, 0, 0);
+}
+
+static const struct nla_policy em_ipt_policy[TCA_EM_IPT_MAX + 1] = {
+   [TCA_EM_IPT_HOOK]   = { .type = NLA_U32 },
+   [TCA_EM_IPT_MATCH_NAME] = { .type = NLA_STRING,
+   .len = XT_EXTENSION_MAXNAMELEN },
+   [TCA_EM_IPT_MATCH_REVISION] = { .type = NLA_U8 },
+   [TCA_EM_IPT_MATCH_DATA] = { .type = NLA_UNSP

[PATCH net-next 1/2] net: sched: ematch: pass protocol to ematch 'change()' handlers

2018-01-23 Thread Eyal Birger
From: Eyal Birger 

In order to allow ematches to create their internal state based on the
L3 protocol specified when creating the filter.

Signed-off-by: Eyal Birger 
---
 include/net/pkt_cls.h | 2 +-
 net/sched/em_canid.c  | 4 ++--
 net/sched/em_ipset.c  | 4 ++--
 net/sched/em_meta.c   | 2 +-
 net/sched/em_nbyte.c  | 4 ++--
 net/sched/em_text.c   | 2 +-
 net/sched/ematch.c| 3 ++-
 7 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 2f8f16a..929b117 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -474,7 +474,7 @@ struct tcf_ematch_tree {
 struct tcf_ematch_ops {
int kind;
int datalen;
-   int (*change)(struct net *net, void *,
+   int (*change)(struct net *net, __be16, void *,
  int, struct tcf_ematch *);
int (*match)(struct sk_buff *, struct tcf_ematch *,
 struct tcf_pkt_info *);
diff --git a/net/sched/em_canid.c b/net/sched/em_canid.c
index ddd883c..445c10d 100644
--- a/net/sched/em_canid.c
+++ b/net/sched/em_canid.c
@@ -120,8 +120,8 @@ static int em_canid_match(struct sk_buff *skb, struct 
tcf_ematch *m,
return match;
 }
 
-static int em_canid_change(struct net *net, void *data, int len,
- struct tcf_ematch *m)
+static int em_canid_change(struct net *net, __be16 protocol, void *data,
+  int len, struct tcf_ematch *m)
 {
struct can_filter *conf = data; /* Array with rules */
struct canid_match *cm;
diff --git a/net/sched/em_ipset.c b/net/sched/em_ipset.c
index c1b23e3..50f7282 100644
--- a/net/sched/em_ipset.c
+++ b/net/sched/em_ipset.c
@@ -19,8 +19,8 @@
 #include 
 #include 
 
-static int em_ipset_change(struct net *net, void *data, int data_len,
-  struct tcf_ematch *em)
+static int em_ipset_change(struct net *net, __be16 protocol, void *data,
+  int data_len, struct tcf_ematch *em)
 {
struct xt_set_info *set = data;
ip_set_id_t index;
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index d6e9711..6892efc 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -904,7 +904,7 @@ static const struct nla_policy meta_policy[TCA_EM_META_MAX 
+ 1] = {
[TCA_EM_META_HDR]   = { .len = sizeof(struct tcf_meta_hdr) },
 };
 
-static int em_meta_change(struct net *net, void *data, int len,
+static int em_meta_change(struct net *net, __be16 protocol, void *data, int 
len,
  struct tcf_ematch *m)
 {
int err;
diff --git a/net/sched/em_nbyte.c b/net/sched/em_nbyte.c
index df3110d..1cd5983 100644
--- a/net/sched/em_nbyte.c
+++ b/net/sched/em_nbyte.c
@@ -23,8 +23,8 @@ struct nbyte_data {
charpattern[0];
 };
 
-static int em_nbyte_change(struct net *net, void *data, int data_len,
-  struct tcf_ematch *em)
+static int em_nbyte_change(struct net *net, __be16 protocol, void *data,
+  int data_len, struct tcf_ematch *em)
 {
struct tcf_em_nbyte *nbyte = data;
 
diff --git a/net/sched/em_text.c b/net/sched/em_text.c
index 73e2ed5..b5d9e21 100644
--- a/net/sched/em_text.c
+++ b/net/sched/em_text.c
@@ -44,7 +44,7 @@ static int em_text_match(struct sk_buff *skb, struct 
tcf_ematch *m,
return skb_find_text(skb, from, to, tm->config) != UINT_MAX;
 }
 
-static int em_text_change(struct net *net, void *data, int len,
+static int em_text_change(struct net *net, __be16 protocol, void *data, int 
len,
  struct tcf_ematch *m)
 {
struct text_match *tm;
diff --git a/net/sched/ematch.c b/net/sched/ematch.c
index 1331a4c..a69abd8 100644
--- a/net/sched/ematch.c
+++ b/net/sched/ematch.c
@@ -242,7 +242,8 @@ static int tcf_em_validate(struct tcf_proto *tp,
goto errout;
 
if (em->ops->change) {
-   err = em->ops->change(net, data, data_len, em);
+   err = em->ops->change(net, tp->protocol, data, data_len,
+ em);
if (err < 0)
goto errout;
} else if (data_len > 0) {
-- 
2.7.4



[PATCH net-next 0/2] net: sched: introduce em_ipt ematch

2018-01-23 Thread Eyal Birger
From: Eyal Birger 

The following patchset introduces a new tc ematch for matching using
netfilter matches.

This allows early classification as well as mirroning/redirecting traffic
based on logic implemented in netfilter extensions.

Example use case is classification based on the incoming IPSec policy used
during decpsulation using the 'policy' iptables extension (xt_policy).

This patchset is an enhancement of a former series ([1]) which allowed only
policy matching following a suggestion by Pablo Neira Ayuso ([2]).

[1] https://patchwork.ozlabs.org/cover/859887/
[2] https://patchwork.ozlabs.org/patch/859888/

Eyal Birger (2):
  net: sched: ematch: pass protocol to ematch 'change()' handlers
  net: sched: add em_ipt ematch for calling xtables matches

 include/net/pkt_cls.h|   2 +-
 include/uapi/linux/pkt_cls.h |   3 +-
 include/uapi/linux/tc_ematch/tc_em_ipt.h |  19 +++
 net/sched/Kconfig|  10 ++
 net/sched/Makefile   |   1 +
 net/sched/em_canid.c |   4 +-
 net/sched/em_ipset.c |   4 +-
 net/sched/em_ipt.c   | 244 +++
 net/sched/em_meta.c  |   2 +-
 net/sched/em_nbyte.c |   4 +-
 net/sched/em_text.c  |   2 +-
 net/sched/ematch.c   |   3 +-
 12 files changed, 287 insertions(+), 11 deletions(-)
 create mode 100644 include/uapi/linux/tc_ematch/tc_em_ipt.h
 create mode 100644 net/sched/em_ipt.c

-- 
2.7.4



Re: [PATCH net-next 2/2] net: sched: add xfrm policy ematch

2018-01-16 Thread Eyal Birger
On Tue, Jan 16, 2018 at 8:30 AM, Cong Wang  wrote:
> On Fri, Jan 12, 2018 at 4:57 AM, Eyal Birger  wrote:
>> +static void em_policy_destroy(struct tcf_ematch *em)
>> +{
>> +   const struct xt_policy_info *info = (const void *)em->data;
>> +
>> +   if (!info)
>> +   return;
>> +
>> +   kfree((void *)em->data);
>> +}
>
> Nit: kfree() could handle NULL, no need to check.

Thanks Cong! I later realized I could use the default ematch destructor,
so this function could be removed entirely. However, as I plan to resubmit this
as a more generic ematch without a direct netfilter dependency, this
code will change significantly.

Thanks again,
Eyal.


Re: [PATCH net-next 1/2] net: netfilter: export xt_policy match_policy_in() as xt_policy_match_policy_in()

2018-01-15 Thread Eyal Birger
On Mon, Jan 15, 2018 at 12:57 PM, Pablo Neira Ayuso  wrote:
> On Sun, Jan 14, 2018 at 02:47:46PM +0200, Eyal Birger wrote:
>> On Fri, Jan 12, 2018 at 4:00 PM, Pablo Neira Ayuso  
>> wrote:
>> > On Fri, Jan 12, 2018 at 03:56:21PM +0200, Eyal Birger wrote:
>> >> On Fri, Jan 12, 2018 at 3:41 PM, Pablo Neira Ayuso  
>> >> wrote:
>> >> > On Fri, Jan 12, 2018 at 02:57:24PM +0200, Eyal Birger wrote:
>> >> >> @@ -51,9 +52,9 @@ match_xfrm_state(const struct xfrm_state *x, const 
>> >> >> struct xt_policy_elem *e,
>> >> >>  MATCH(reqid, x->props.reqid);
>> >> >>  }
>> >> >>
>> >> >> -static int
>> >> >> -match_policy_in(const struct sk_buff *skb, const struct 
>> >> >> xt_policy_info *info,
>> >> >> - unsigned short family)
>> >> >> +int xt_policy_match_policy_in(const struct sk_buff *skb,
>> >> >> +   const struct xt_policy_info *info,
>> >> >> +   unsigned short family)
>> >> >>  {
>> >> >>   const struct xt_policy_elem *e;
>> >> >>   const struct sec_path *sp = skb->sp;
>> >> >> @@ -80,10 +81,11 @@ match_policy_in(const struct sk_buff *skb, const 
>> >> >> struct xt_policy_info *info,
>> >> >>
>> >> >>   return strict ? 1 : 0;
>> >> >>  }
>> >> >> +EXPORT_SYMBOL_GPL(xt_policy_match_policy_in);
>> >> >
>> >> > If you just want to call xt_policy_match from tc, then you could use
>> >> > tc ipt infrastructure instead.
>> >>
>> >> Thanks for the suggestion -
>> >> Are you referring to act_ipt? it looks like it allows calling targets;
>> >> I couldn't find a classifier calling a netfilter matcher.
>> >
>> > Then, I'd suggest you extend that infrastructure to alllow to call
>> > matches, so we reduce the number of interdepencies between different
>> > subsystems.
>>
>> This appears very versatile. though in this case the use of the xtables code 
>> and
>> structures was done in order to avoid introducing new uapi structures
>> and supporting
>> match code, not necessarily to expose the full capabilities of extended 
>> matches,
>> similar in spirit to what was done in the em_ipset ematch.
>>
>> Perhaps in order to avoid the direct export of xt_policy code, I could call
>> xt_request_find_match() from the em_policy module, requesting the
>> xt_policy match?
>> this way api exposure is minimized while not overly complicating the
>> scope of this feature.
>>
>> What do you think?
>
> That would look better indeed.
>
> But once you call xt_request_find_match() from there, how far is to
> allow any arbitrary match? I think you only have to specify the match
> name, family and the binary layout structure that represents
> xt_policy, right?
>

I don't think that should be a problem. I'd need to pass the protocol onto
the ematches .change() callbacks and get the appropriate match from there.

> I'm telling this, because I think it would be fair enough to me if you
> add the generic infrastructure to the kernel to allow arbitrary load
> of xt matches, and then from userspace you just add the code to
> support this which is what you need.
>
> Probably someone else - not you - may follow up later on to generalize
> the userspace codebase to support other matches, by when that happens,
> the right bits will be in the kernel already.

I'm fine with submitting the more generic infrastructure.
Will follow up with a new series.

Thanks again!
Eyal.


Re: [PATCH net-next 1/2] net: netfilter: export xt_policy match_policy_in() as xt_policy_match_policy_in()

2018-01-14 Thread Eyal Birger
On Fri, Jan 12, 2018 at 4:00 PM, Pablo Neira Ayuso  wrote:
> On Fri, Jan 12, 2018 at 03:56:21PM +0200, Eyal Birger wrote:
>> On Fri, Jan 12, 2018 at 3:41 PM, Pablo Neira Ayuso  
>> wrote:
>> > On Fri, Jan 12, 2018 at 02:57:24PM +0200, Eyal Birger wrote:
>> >> @@ -51,9 +52,9 @@ match_xfrm_state(const struct xfrm_state *x, const 
>> >> struct xt_policy_elem *e,
>> >>  MATCH(reqid, x->props.reqid);
>> >>  }
>> >>
>> >> -static int
>> >> -match_policy_in(const struct sk_buff *skb, const struct xt_policy_info 
>> >> *info,
>> >> - unsigned short family)
>> >> +int xt_policy_match_policy_in(const struct sk_buff *skb,
>> >> +   const struct xt_policy_info *info,
>> >> +   unsigned short family)
>> >>  {
>> >>   const struct xt_policy_elem *e;
>> >>   const struct sec_path *sp = skb->sp;
>> >> @@ -80,10 +81,11 @@ match_policy_in(const struct sk_buff *skb, const 
>> >> struct xt_policy_info *info,
>> >>
>> >>   return strict ? 1 : 0;
>> >>  }
>> >> +EXPORT_SYMBOL_GPL(xt_policy_match_policy_in);
>> >
>> > If you just want to call xt_policy_match from tc, then you could use
>> > tc ipt infrastructure instead.
>>
>> Thanks for the suggestion -
>> Are you referring to act_ipt? it looks like it allows calling targets;
>> I couldn't find a classifier calling a netfilter matcher.
>
> Then, I'd suggest you extend that infrastructure to alllow to call
> matches, so we reduce the number of interdepencies between different
> subsystems.

This appears very versatile. though in this case the use of the xtables code and
structures was done in order to avoid introducing new uapi structures
and supporting
match code, not necessarily to expose the full capabilities of extended matches,
similar in spirit to what was done in the em_ipset ematch.

Perhaps in order to avoid the direct export of xt_policy code, I could call
xt_request_find_match() from the em_policy module, requesting the
xt_policy match?
this way api exposure is minimized while not overly complicating the
scope of this feature.

What do you think?
Eyal.


Re: [PATCH net-next 1/2] net: netfilter: export xt_policy match_policy_in() as xt_policy_match_policy_in()

2018-01-12 Thread Eyal Birger
On Fri, Jan 12, 2018 at 3:41 PM, Pablo Neira Ayuso  wrote:
> On Fri, Jan 12, 2018 at 02:57:24PM +0200, Eyal Birger wrote:
>> @@ -51,9 +52,9 @@ match_xfrm_state(const struct xfrm_state *x, const struct 
>> xt_policy_elem *e,
>>  MATCH(reqid, x->props.reqid);
>>  }
>>
>> -static int
>> -match_policy_in(const struct sk_buff *skb, const struct xt_policy_info 
>> *info,
>> - unsigned short family)
>> +int xt_policy_match_policy_in(const struct sk_buff *skb,
>> +   const struct xt_policy_info *info,
>> +   unsigned short family)
>>  {
>>   const struct xt_policy_elem *e;
>>   const struct sec_path *sp = skb->sp;
>> @@ -80,10 +81,11 @@ match_policy_in(const struct sk_buff *skb, const struct 
>> xt_policy_info *info,
>>
>>   return strict ? 1 : 0;
>>  }
>> +EXPORT_SYMBOL_GPL(xt_policy_match_policy_in);
>
> If you just want to call xt_policy_match from tc, then you could use
> tc ipt infrastructure instead.

Thanks for the suggestion -
Are you referring to act_ipt? it looks like it allows calling targets;
I couldn't
find a classifier calling a netfilter matcher.

Eyal.


[PATCH net-next 2/2] net: sched: add xfrm policy ematch

2018-01-12 Thread Eyal Birger
From: Eyal Birger 

Allows classification based on the incoming IPSec policy used during
decpsulation.

This allows similar matching capabilities to those provided by netfilter
xt_policy module, and uses the same data strcuture - but from a tc entry
point.

Signed-off-by: Eyal Birger 
---
 include/uapi/linux/pkt_cls.h |   3 +-
 net/sched/Kconfig|  10 
 net/sched/Makefile   |   1 +
 net/sched/em_policy.c| 117 +++
 4 files changed, 130 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/em_policy.c

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 46c5066..963842c 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -555,7 +555,8 @@ enum {
 #defineTCF_EM_VLAN 6
 #defineTCF_EM_CANID7
 #defineTCF_EM_IPSET8
-#defineTCF_EM_MAX  8
+#defineTCF_EM_POLICY   9
+#defineTCF_EM_MAX  9
 
 enum {
TCF_EM_PROG_TC
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index c03d86a..0670f53 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -658,6 +658,16 @@ config NET_EMATCH_IPSET
  To compile this code as a module, choose M here: the
  module will be called em_ipset.
 
+config NET_EMATCH_POLICY
+   tristate "Policy"
+   depends on NET_EMATCH && NETFILTER_XT_MATCH_POLICY
+   ---help---
+ Say Y here if you want to be able to classify packets based on
+ IPsec policy that was used during decapsulation
+
+ To compile this code as a module, choose M here: the
+ module will be called em_policy.
+
 config NET_CLS_ACT
bool "Actions"
select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5b63544..7ca02a1 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -75,3 +75,4 @@ obj-$(CONFIG_NET_EMATCH_META) += em_meta.o
 obj-$(CONFIG_NET_EMATCH_TEXT)  += em_text.o
 obj-$(CONFIG_NET_EMATCH_CANID) += em_canid.o
 obj-$(CONFIG_NET_EMATCH_IPSET) += em_ipset.o
+obj-$(CONFIG_NET_EMATCH_POLICY)+= em_policy.o
diff --git a/net/sched/em_policy.c b/net/sched/em_policy.c
new file mode 100644
index 000..94ef318
--- /dev/null
+++ b/net/sched/em_policy.c
@@ -0,0 +1,117 @@
+/*
+ * net/sched/em_policy.c IPSec Policy Ematch
+ *
+ * (c) 2018 Eyal Birger 
+ *
+ * Parts taken from netfilter/xt_policy.h:
+ * Copyright (c) 2004,2005 Patrick McHardy, 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int em_policy_change(struct net *net, void *data, int data_len,
+   struct tcf_ematch *em)
+{
+   const struct xt_policy_info *info = (const void *)data;
+   __u16 dir_flags;
+
+   if (data_len != sizeof(*info))
+   return -EINVAL;
+
+   if (info->len > XT_POLICY_MAX_ELEM) {
+   pr_info("too many policy elements\n");
+   return -EINVAL;
+   }
+
+   dir_flags = info->flags & (XT_POLICY_MATCH_IN | XT_POLICY_MATCH_OUT);
+   if (dir_flags != XT_POLICY_MATCH_IN) {
+   pr_info("Only incoming policy can be matched\n");
+   return -EINVAL;
+   }
+
+   em->datalen = sizeof(*info);
+   em->data = (unsigned long)kmemdup(data, em->datalen, GFP_KERNEL);
+   if (!em->data)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void em_policy_destroy(struct tcf_ematch *em)
+{
+   const struct xt_policy_info *info = (const void *)em->data;
+
+   if (!info)
+   return;
+
+   kfree((void *)em->data);
+}
+
+static int em_policy_match(struct sk_buff *skb, struct tcf_ematch *em,
+  struct tcf_pkt_info *info)
+{
+   const struct xt_policy_info *pol = (const void *)em->data;
+   unsigned short pf;
+   int ret;
+
+   switch (tc_skb_protocol(skb)) {
+   case htons(ETH_P_IP):
+   pf = NFPROTO_IPV4;
+   break;
+   case htons(ETH_P_IPV6):
+   pf = NFPROTO_IPV6;
+   break;
+   default:
+   return false;
+   }
+
+   ret = xt_policy_match_policy_in(skb, pol, pf);
+   if (ret < 0)
+   ret = pol->flags & XT_POLICY_MATCH_NONE ? true : false;
+   else if (pol->flags & XT_POLICY_MATCH_NONE)
+   ret = false;
+
+   return ret;
+}
+
+static struct tcf_ematch_ops em_policy_ops = {
+   .kind = TCF_EM_POLICY,
+   .change   = em_policy_change,
+   .destroy  = em_policy_destroy,
+

[PATCH net-next 0/2] net: sched: Introduce em_policy ematch

2018-01-12 Thread Eyal Birger
From: Eyal Birger 

The following patchset introduces a new tc ematch for matching IPSec
traffic from a tc context.

This allows early classification as well as mirroning/redirecting IPSec
traffic based on decapsulation criteria.

The matching functionality is based on the netfilter xt_policy match, and
shares code and data structures.

Eyal Birger (2):
  net: netfilter: export xt_policy match_policy_in() as
xt_policy_match_policy_in()
  net: sched: add xfrm policy ematch

 include/net/netfilter/xt_policy.h |  12 
 include/uapi/linux/pkt_cls.h  |   3 +-
 net/netfilter/xt_policy.c |  18 +++---
 net/sched/Kconfig |  10 
 net/sched/Makefile|   1 +
 net/sched/em_policy.c | 117 ++
 6 files changed, 152 insertions(+), 9 deletions(-)
 create mode 100644 include/net/netfilter/xt_policy.h
 create mode 100644 net/sched/em_policy.c

-- 
2.7.4



[PATCH net-next 1/2] net: netfilter: export xt_policy match_policy_in() as xt_policy_match_policy_in()

2018-01-12 Thread Eyal Birger
From: Eyal Birger 

Expose this functionality so it could be usable from a tc classifier.

The rename of match_policy_out() is done for consistency though it is not
exported.

Signed-off-by: Eyal Birger 
---
 include/net/netfilter/xt_policy.h | 12 
 net/netfilter/xt_policy.c | 18 ++
 2 files changed, 22 insertions(+), 8 deletions(-)
 create mode 100644 include/net/netfilter/xt_policy.h

diff --git a/include/net/netfilter/xt_policy.h 
b/include/net/netfilter/xt_policy.h
new file mode 100644
index 000..99dcd57
--- /dev/null
+++ b/include/net/netfilter/xt_policy.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _XT_POLICY_INT_H
+#define _XT_POLICY_INT_H
+
+#include 
+#include 
+
+int xt_policy_match_policy_in(const struct sk_buff *skb,
+ const struct xt_policy_info *info,
+ unsigned short family);
+
+#endif /* _XT_POLICY_INT_H */
diff --git a/net/netfilter/xt_policy.c b/net/netfilter/xt_policy.c
index 5639fb0..4f9d0b1 100644
--- a/net/netfilter/xt_policy.c
+++ b/net/netfilter/xt_policy.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 MODULE_AUTHOR("Patrick McHardy ");
 MODULE_DESCRIPTION("Xtables: IPsec policy match");
@@ -51,9 +52,9 @@ match_xfrm_state(const struct xfrm_state *x, const struct 
xt_policy_elem *e,
   MATCH(reqid, x->props.reqid);
 }
 
-static int
-match_policy_in(const struct sk_buff *skb, const struct xt_policy_info *info,
-   unsigned short family)
+int xt_policy_match_policy_in(const struct sk_buff *skb,
+ const struct xt_policy_info *info,
+ unsigned short family)
 {
const struct xt_policy_elem *e;
const struct sec_path *sp = skb->sp;
@@ -80,10 +81,11 @@ match_policy_in(const struct sk_buff *skb, const struct 
xt_policy_info *info,
 
return strict ? 1 : 0;
 }
+EXPORT_SYMBOL_GPL(xt_policy_match_policy_in);
 
-static int
-match_policy_out(const struct sk_buff *skb, const struct xt_policy_info *info,
-unsigned short family)
+static int xt_policy_match_policy_out(const struct sk_buff *skb,
+ const struct xt_policy_info *info,
+ unsigned short family)
 {
const struct xt_policy_elem *e;
const struct dst_entry *dst = skb_dst(skb);
@@ -117,9 +119,9 @@ policy_mt(const struct sk_buff *skb, struct xt_action_param 
*par)
int ret;
 
if (info->flags & XT_POLICY_MATCH_IN)
-   ret = match_policy_in(skb, info, xt_family(par));
+   ret = xt_policy_match_policy_in(skb, info, xt_family(par));
else
-   ret = match_policy_out(skb, info, xt_family(par));
+   ret = xt_policy_match_policy_out(skb, info, xt_family(par));
 
if (ret < 0)
ret = info->flags & XT_POLICY_MATCH_NONE ? true : false;
-- 
2.7.4



Re: [PATCH net] net: fib_rules: Fix fib_rules_ops->compare implementations to support exact match

2017-10-03 Thread Eyal Birger
Hi David,

On Wed, Oct 4, 2017 at 12:54 AM, David Miller  wrote:
> From: Shmulik Ladkani 
> Date: Sat, 30 Sep 2017 11:59:09 +0300
>
>> This leads to inconsistencies, depending on order of operations, e.g.:
>
> I don't see any inconsistency.  When you insert using NLM_F_EXCL the
> insertion fails if any existing rule matches or overlaps in any way
> with the keys in the new rule.
>
> Sorry I'm not going to apply this.

The inconsistency we saw is that 0.0.0.0/0 is treated differently compared to
all other subnets - for which overlaps are not disallowed - e.g. this succeeds:

# ip ru add from 10.0.0.0/8 iif eth2 pref 33 table 33
# ip ru add from 0.0.0.0/1 iif eth2 pref 33 table 33
# ip ru add from 128.0.0.0/1 iif eth2 pref 33 table 33

Though being functionally equivalent to adding from=0.0.0.0/0.

So our understanding was that 'different subnet==different rule', similar to the
route addition behavior with NLM_F_EXCL.

Best regards,
Eyal.


Re: [PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Eyal Birger
On Wed, Oct 5, 2016 at 8:23 PM, Jiri Benc  wrote:
> On Wed, 5 Oct 2016 17:18:08 +0300, Eyal Birger wrote:
>> I think at this point, 'eth' may point to a freed packet.
>
> It may but how does that matter? eth is not used beyond that point.

Definitely a nit. For sure not critical.

Just seemed less future safe to keep a pointer to an old packet lying around.

Eyal.


Re: [PATCH net-next v2 1/3] openvswitch: normalize vlan rx path

2016-10-05 Thread Eyal Birger
Hi,

On Wed, Oct 5, 2016 at 4:07 PM, Jiri Benc  wrote:
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 4d67ea856067..c47b3da8ecf2 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -594,6 +594,16 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
> struct genl_info *info)
> else
> packet->protocol = htons(ETH_P_802_2);
>
> +   if (eth_type_vlan(packet->protocol)) {
> +   __skb_pull(packet, ETH_HLEN);
> +   skb_reset_network_header(packet);
> +   skb_reset_mac_len(packet);
> +   packet = skb_vlan_untag(packet);
> +   if (unlikely(!packet))
> +   goto err;

I think at this point, 'eth' may point to a freed packet.
Maybe replace the 'eth' variable with a 'proto' variable caching the
value of eth_hdr(packet)->h_proto?

> +   skb_push(packet, ETH_HLEN);
> +   }
> +
> /* Set packet's mru */
> if (a[OVS_PACKET_ATTR_MRU]) {
> mru = nla_get_u16(a[OVS_PACKET_ATTR_MRU]);
> --
> 1.8.3.1
>

Eyal.