Re: [PATCH 2/2] openvswitch: Add eventmask support to CT action.
Thanks Joe, I’ll issue a v2 with the comment fix and retain the acks, so it should be good to go. Jarno > On Apr 20, 2017, at 11:53 AM, Joe Stringer <j...@ovn.org> wrote: > > On 19 April 2017 at 18:49, Jarno Rajahalme <ja...@ovn.org> wrote: >> Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK, >> which can be used in conjunction with the commit flag >> (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which >> conntrack events (IPCT_*) should be delivered via the Netfilter >> netlink multicast groups. Default behavior depends on the system >> configuration, but typically a lot of events are delivered. This can be >> very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some >> types of events are of interest. >> >> Netfilter core init_conntrack() adds the event cache extension, so we >> only need to set the ctmask value. However, if the system is >> configured without support for events, the setting will be skipped due >> to extension not being found. >> >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> >> --- > > Thanks Jarno, looks good overall. Minor nit in the API description below. > > Acked-by: Joe Stringer <j...@ovn.org> > >> include/uapi/linux/openvswitch.h | 12 >> net/openvswitch/conntrack.c | 27 +++ >> 2 files changed, 39 insertions(+) >> >> diff --git a/include/uapi/linux/openvswitch.h >> b/include/uapi/linux/openvswitch.h >> index 66d1c3c..38ae95d 100644 >> --- a/include/uapi/linux/openvswitch.h >> +++ b/include/uapi/linux/openvswitch.h >> @@ -693,6 +693,17 @@ struct ovs_action_hash { >> * nothing if the connection is already committed will check that the current >> * packet is in conntrack entry's original direction. If directionality does >> * not match, will delete the existing conntrack entry and commit a new one. >> + * @OVS_CT_ATTR_EVENTMASK: Mask of bits indicating which conntrack event >> types >> + * (enum ip_conntrack_events IPCT_*) should be reported. For any bit set to >> + * zero, the corresponding event type is not generated. Default behavior >> + * depends on system configuration, but typically all event types are >> + * generated, hence listening on UPDATE events may get a lot of events. > > s/UPDATE/NFNLGRP_CONNTRACK_UPDATE/ > >> + * Explicitly passing this attribute allows limiting the updates received to >> + * the events of interest. The bit 1 << IPCT_NEW, 1 << IPCT_RELATED, and >> + * 1 << IPCT_DESTROY must be set to ones for those events to be received on >> + * NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups, respectively. >> + * Remaining bits control the changes for which an event is delivered on the >> + * NFNLGRP_CONNTRACK_UPDATE group. >> */ >> enum ovs_ct_attr { >>OVS_CT_ATTR_UNSPEC, >> @@ -704,6 +715,7 @@ enum ovs_ct_attr { >> related connections. */ >>OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ >>OVS_CT_ATTR_FORCE_COMMIT, /* No argument */ >> + OVS_CT_ATTR_EVENTMASK, /* u32 mask of IPCT_* events. */ >>__OVS_CT_ATTR_MAX >> }; >> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c >> index 58de4c2..4f7c3b5 100644 >> --- a/net/openvswitch/conntrack.c >> +++ b/net/openvswitch/conntrack.c >> @@ -66,7 +66,9 @@ struct ovs_conntrack_info { >>u8 commit : 1; >>u8 nat : 3; /* enum ovs_ct_nat */ >>u8 force : 1; >> + u8 have_eventmask : 1; >>u16 family; >> + u32 eventmask; /* Mask of 1 << IPCT_*. */ >>struct md_mark mark; >>struct md_labels labels; >> #ifdef CONFIG_NF_NAT_NEEDED >> @@ -1007,6 +1009,20 @@ static int ovs_ct_commit(struct net *net, struct >> sw_flow_key *key, >>if (!ct) >>return 0; >> >> + /* Set the conntrack event mask if given. NEW and DELETE events have >> +* their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener >> +* typically would receive many kinds of updates. Setting the event >> +* mask allows those events to be filtered. The set event mask will >> +* remain in effect for the lifetime of the connection unless changed >> +* by a further CT action with both the commit flag and the eventmask >> +* option. */ >> + if (info->have_eventmask) { >> + struct nf_conntrack_ecache *cache
[PATCH net-next v2 2/2] openvswitch: Add eventmask support to CT action.
Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK, which can be used in conjunction with the commit flag (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which conntrack events (IPCT_*) should be delivered via the Netfilter netlink multicast groups. Default behavior depends on the system configuration, but typically a lot of events are delivered. This can be very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some types of events are of interest. Netfilter core init_conntrack() adds the event cache extension, so we only need to set the ctmask value. However, if the system is configured without support for events, the setting will be skipped due to extension not being found. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Reviewed-by: Greg Rose <gvrose8...@gmail.com> Acked-by: Joe Stringer <j...@ovn.org> --- include/uapi/linux/openvswitch.h | 12 net/openvswitch/conntrack.c | 27 +++ 2 files changed, 39 insertions(+) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 66d1c3c..61b7d36 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -693,6 +693,17 @@ struct ovs_action_hash { * nothing if the connection is already committed will check that the current * packet is in conntrack entry's original direction. If directionality does * not match, will delete the existing conntrack entry and commit a new one. + * @OVS_CT_ATTR_EVENTMASK: Mask of bits indicating which conntrack event types + * (enum ip_conntrack_events IPCT_*) should be reported. For any bit set to + * zero, the corresponding event type is not generated. Default behavior + * depends on system configuration, but typically all event types are + * generated, hence listening on NFNLGRP_CONNTRACK_UPDATE events may get a lot + * of events. Explicitly passing this attribute allows limiting the updates + * received to the events of interest. The bit 1 << IPCT_NEW, 1 << + * IPCT_RELATED, and 1 << IPCT_DESTROY must be set to ones for those events to + * be received on NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups, + * respectively. Remaining bits control the changes for which an event is + * delivered on the NFNLGRP_CONNTRACK_UPDATE group. */ enum ovs_ct_attr { OVS_CT_ATTR_UNSPEC, @@ -704,6 +715,7 @@ enum ovs_ct_attr { related connections. */ OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ OVS_CT_ATTR_FORCE_COMMIT, /* No argument */ + OVS_CT_ATTR_EVENTMASK, /* u32 mask of IPCT_* events. */ __OVS_CT_ATTR_MAX }; diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 58de4c2..4f7c3b5 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -66,7 +66,9 @@ struct ovs_conntrack_info { u8 commit : 1; u8 nat : 3; /* enum ovs_ct_nat */ u8 force : 1; + u8 have_eventmask : 1; u16 family; + u32 eventmask; /* Mask of 1 << IPCT_*. */ struct md_mark mark; struct md_labels labels; #ifdef CONFIG_NF_NAT_NEEDED @@ -1007,6 +1009,20 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, if (!ct) return 0; + /* Set the conntrack event mask if given. NEW and DELETE events have +* their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener +* typically would receive many kinds of updates. Setting the event +* mask allows those events to be filtered. The set event mask will +* remain in effect for the lifetime of the connection unless changed +* by a further CT action with both the commit flag and the eventmask +* option. */ + if (info->have_eventmask) { + struct nf_conntrack_ecache *cache = nf_ct_ecache_find(ct); + + if (cache) + cache->ctmask = info->eventmask; + } + /* Apply changes before confirming the connection so that the initial * conntrack NEW netlink event carries the values given in the CT * action. @@ -1238,6 +1254,8 @@ static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = { /* NAT length is checked when parsing the nested attributes. */ [OVS_CT_ATTR_NAT] = { .minlen = 0, .maxlen = INT_MAX }, #endif + [OVS_CT_ATTR_EVENTMASK] = { .minlen = sizeof(u32), + .maxlen = sizeof(u32) }, }; static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, @@ -1316,6 +1334,11 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, break; } #endif + case OVS_CT_ATTR_EVENTMASK: + info->have_eventmask = true; + info->
[PATCH net-next v2 1/2] openvswitch: Typo fix.
Fix typo in a comment. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Greg Rose <gvrose8...@gmail.com> --- net/openvswitch/conntrack.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 7b2c2fc..58de4c2 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -373,7 +373,7 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, } /* Labels are included in the IPCTNL_MSG_CT_NEW event only if the -* IPCT_LABEL bit it set in the event cache. +* IPCT_LABEL bit is set in the event cache. */ nf_conntrack_event_cache(IPCT_LABEL, ct); -- 2.1.4
Re: [PATCH 1/2] openvswitch: Typo fix.
Sorry for the chatter, forgot to include “net-next” in the title, sending again. Jarno > On Apr 19, 2017, at 6:49 PM, Jarno Rajahalme <ja...@ovn.org> wrote: > > Fix typo in a comment. > > Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > --- > net/openvswitch/conntrack.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c > index 7b2c2fc..58de4c2 100644 > --- a/net/openvswitch/conntrack.c > +++ b/net/openvswitch/conntrack.c > @@ -373,7 +373,7 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct > sw_flow_key *key, > } > > /* Labels are included in the IPCTNL_MSG_CT_NEW event only if the > - * IPCT_LABEL bit it set in the event cache. > + * IPCT_LABEL bit is set in the event cache. >*/ > nf_conntrack_event_cache(IPCT_LABEL, ct); > > -- > 2.1.4 >
[PATCH net-next 1/2] openvswitch: Typo fix.
Fix typo in a comment. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 7b2c2fc..58de4c2 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -373,7 +373,7 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, } /* Labels are included in the IPCTNL_MSG_CT_NEW event only if the -* IPCT_LABEL bit it set in the event cache. +* IPCT_LABEL bit is set in the event cache. */ nf_conntrack_event_cache(IPCT_LABEL, ct); -- 2.1.4
[PATCH 2/2] openvswitch: Add eventmask support to CT action.
Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK, which can be used in conjunction with the commit flag (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which conntrack events (IPCT_*) should be delivered via the Netfilter netlink multicast groups. Default behavior depends on the system configuration, but typically a lot of events are delivered. This can be very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some types of events are of interest. Netfilter core init_conntrack() adds the event cache extension, so we only need to set the ctmask value. However, if the system is configured without support for events, the setting will be skipped due to extension not being found. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 12 net/openvswitch/conntrack.c | 27 +++ 2 files changed, 39 insertions(+) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 66d1c3c..38ae95d 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -693,6 +693,17 @@ struct ovs_action_hash { * nothing if the connection is already committed will check that the current * packet is in conntrack entry's original direction. If directionality does * not match, will delete the existing conntrack entry and commit a new one. + * @OVS_CT_ATTR_EVENTMASK: Mask of bits indicating which conntrack event types + * (enum ip_conntrack_events IPCT_*) should be reported. For any bit set to + * zero, the corresponding event type is not generated. Default behavior + * depends on system configuration, but typically all event types are + * generated, hence listening on UPDATE events may get a lot of events. + * Explicitly passing this attribute allows limiting the updates received to + * the events of interest. The bit 1 << IPCT_NEW, 1 << IPCT_RELATED, and + * 1 << IPCT_DESTROY must be set to ones for those events to be received on + * NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups, respectively. + * Remaining bits control the changes for which an event is delivered on the + * NFNLGRP_CONNTRACK_UPDATE group. */ enum ovs_ct_attr { OVS_CT_ATTR_UNSPEC, @@ -704,6 +715,7 @@ enum ovs_ct_attr { related connections. */ OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ OVS_CT_ATTR_FORCE_COMMIT, /* No argument */ + OVS_CT_ATTR_EVENTMASK, /* u32 mask of IPCT_* events. */ __OVS_CT_ATTR_MAX }; diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 58de4c2..4f7c3b5 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -66,7 +66,9 @@ struct ovs_conntrack_info { u8 commit : 1; u8 nat : 3; /* enum ovs_ct_nat */ u8 force : 1; + u8 have_eventmask : 1; u16 family; + u32 eventmask; /* Mask of 1 << IPCT_*. */ struct md_mark mark; struct md_labels labels; #ifdef CONFIG_NF_NAT_NEEDED @@ -1007,6 +1009,20 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, if (!ct) return 0; + /* Set the conntrack event mask if given. NEW and DELETE events have +* their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener +* typically would receive many kinds of updates. Setting the event +* mask allows those events to be filtered. The set event mask will +* remain in effect for the lifetime of the connection unless changed +* by a further CT action with both the commit flag and the eventmask +* option. */ + if (info->have_eventmask) { + struct nf_conntrack_ecache *cache = nf_ct_ecache_find(ct); + + if (cache) + cache->ctmask = info->eventmask; + } + /* Apply changes before confirming the connection so that the initial * conntrack NEW netlink event carries the values given in the CT * action. @@ -1238,6 +1254,8 @@ static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = { /* NAT length is checked when parsing the nested attributes. */ [OVS_CT_ATTR_NAT] = { .minlen = 0, .maxlen = INT_MAX }, #endif + [OVS_CT_ATTR_EVENTMASK] = { .minlen = sizeof(u32), + .maxlen = sizeof(u32) }, }; static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, @@ -1316,6 +1334,11 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, break; } #endif + case OVS_CT_ATTR_EVENTMASK: + info->have_eventmask = true; + info->eventmask = nla_get_u32(a); + break; + default:
[PATCH 1/2] openvswitch: Typo fix.
Fix typo in a comment. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 7b2c2fc..58de4c2 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -373,7 +373,7 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, } /* Labels are included in the IPCTNL_MSG_CT_NEW event only if the -* IPCT_LABEL bit it set in the event cache. +* IPCT_LABEL bit is set in the event cache. */ nf_conntrack_event_cache(IPCT_LABEL, ct); -- 2.1.4
[PATCH net-next 2/2] openvswitch: Add eventmask support to CT action.
Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK, which can be used in conjunction with the commit flag (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which conntrack events (IPCT_*) should be delivered via the Netfilter netlink multicast groups. Default behavior depends on the system configuration, but typically a lot of events are delivered. This can be very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some types of events are of interest. Netfilter core init_conntrack() adds the event cache extension, so we only need to set the ctmask value. However, if the system is configured without support for events, the setting will be skipped due to extension not being found. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 12 net/openvswitch/conntrack.c | 27 +++ 2 files changed, 39 insertions(+) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 66d1c3c..38ae95d 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -693,6 +693,17 @@ struct ovs_action_hash { * nothing if the connection is already committed will check that the current * packet is in conntrack entry's original direction. If directionality does * not match, will delete the existing conntrack entry and commit a new one. + * @OVS_CT_ATTR_EVENTMASK: Mask of bits indicating which conntrack event types + * (enum ip_conntrack_events IPCT_*) should be reported. For any bit set to + * zero, the corresponding event type is not generated. Default behavior + * depends on system configuration, but typically all event types are + * generated, hence listening on UPDATE events may get a lot of events. + * Explicitly passing this attribute allows limiting the updates received to + * the events of interest. The bit 1 << IPCT_NEW, 1 << IPCT_RELATED, and + * 1 << IPCT_DESTROY must be set to ones for those events to be received on + * NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups, respectively. + * Remaining bits control the changes for which an event is delivered on the + * NFNLGRP_CONNTRACK_UPDATE group. */ enum ovs_ct_attr { OVS_CT_ATTR_UNSPEC, @@ -704,6 +715,7 @@ enum ovs_ct_attr { related connections. */ OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ OVS_CT_ATTR_FORCE_COMMIT, /* No argument */ + OVS_CT_ATTR_EVENTMASK, /* u32 mask of IPCT_* events. */ __OVS_CT_ATTR_MAX }; diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 58de4c2..4f7c3b5 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -66,7 +66,9 @@ struct ovs_conntrack_info { u8 commit : 1; u8 nat : 3; /* enum ovs_ct_nat */ u8 force : 1; + u8 have_eventmask : 1; u16 family; + u32 eventmask; /* Mask of 1 << IPCT_*. */ struct md_mark mark; struct md_labels labels; #ifdef CONFIG_NF_NAT_NEEDED @@ -1007,6 +1009,20 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, if (!ct) return 0; + /* Set the conntrack event mask if given. NEW and DELETE events have +* their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener +* typically would receive many kinds of updates. Setting the event +* mask allows those events to be filtered. The set event mask will +* remain in effect for the lifetime of the connection unless changed +* by a further CT action with both the commit flag and the eventmask +* option. */ + if (info->have_eventmask) { + struct nf_conntrack_ecache *cache = nf_ct_ecache_find(ct); + + if (cache) + cache->ctmask = info->eventmask; + } + /* Apply changes before confirming the connection so that the initial * conntrack NEW netlink event carries the values given in the CT * action. @@ -1238,6 +1254,8 @@ static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = { /* NAT length is checked when parsing the nested attributes. */ [OVS_CT_ATTR_NAT] = { .minlen = 0, .maxlen = INT_MAX }, #endif + [OVS_CT_ATTR_EVENTMASK] = { .minlen = sizeof(u32), + .maxlen = sizeof(u32) }, }; static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, @@ -1316,6 +1334,11 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, break; } #endif + case OVS_CT_ATTR_EVENTMASK: + info->have_eventmask = true; + info->eventmask = nla_get_u32(a); + break; + default:
[PATCH net] openvswitch: Fix refcount leak on force commit.
The reference count held for skb needs to be released when the skb's nfct pointer is cleared regardless of if nf_ct_delete() is called or not. Failing to release the skb's reference cound led to deferred conntrack cleanup spinning forever within nf_conntrack_cleanup_net_list() when cleaning up a network namespace: kworker/u16:0-19025 [004] 45981067.173642: sched_switch: kworker/u16:0:19025 [120] R ==> rcu_preempt:7 [120] kworker/u16:0-19025 [004] 45981067.173651: kernel_stack: => ___preempt_schedule (a001ed36) => _raw_spin_unlock_bh (a0713290) => nf_ct_iterate_cleanup (c00a4454) => nf_conntrack_cleanup_net_list (c00a5e1e) => nf_conntrack_pernet_exit (c00a63dd) => ops_exit_list.isra.1 (a06075f3) => cleanup_net (a0607df0) => process_one_work (a0084c31) => worker_thread (a008592b) => kthread (a008bee2) => ret_from_fork (a071b67c) Fixes: dd41d33f0b03 ("openvswitch: Add force commit.") Reported-by: Yang Song <yangs...@vmware.com> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index e0a8777..7b2c2fc 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -643,8 +643,8 @@ static bool skb_nfct_cached(struct net *net, */ if (nf_ct_is_confirmed(ct)) nf_ct_delete(ct, 0, 0); - else - nf_conntrack_put(>ct_general); + + nf_conntrack_put(>ct_general); nf_ct_set(skb, NULL, 0); return false; } -- 2.1.4
[PATCH net-next 2/2] netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value.
Commit 4dee62b1 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") inadvertently changed the successful return value of nf_ct_expect_related_report() from 0 to 1 due to __nf_ct_expect_check() returning 1 on success. Prevent this regression in the future by changing the return value of __nf_ct_expect_check() to 0 on success. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/netfilter/nf_conntrack_expect.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index d6ace69..4b2e1fb 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -410,7 +410,7 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect) struct net *net = nf_ct_exp_net(expect); struct hlist_node *next; unsigned int h; - int ret = 1; + int ret = 0; if (!master_help) { ret = -ESHUTDOWN; @@ -460,7 +460,7 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, spin_lock_bh(_conntrack_expect_lock); ret = __nf_ct_expect_check(expect); - if (ret <= 0) + if (ret < 0) goto out; nf_ct_expect_insert(expect); -- 2.1.4
[PATCH net-next 1/2] netfilter: nf_ct_expect: nf_ct_expect_related_report(): Return zero on success,
Commit 4dee62b1 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") inadvertently changed the successful return value of nf_ct_expect_related_report() from 0 to 1, which caused openvswitch conntrack integration fail in FTP test cases. Fix this by always returning zero on the success code path. Fixes: 4dee62b1 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/netfilter/nf_conntrack_expect.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index e19a697..d6ace69 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -467,7 +467,7 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, spin_unlock_bh(_conntrack_expect_lock); nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report); - return ret; + return 0; out: spin_unlock_bh(_conntrack_expect_lock); return ret; -- 2.1.4
[PATCH net-next] openvswitch: Set event bit after initializing labels.
Connlabels are included in conntrack netlink event messages only if the IPCT_LABEL bit is set in the event cache (see ctnetlink_conntrack_event()). Set it after initializing labels for a new connection. Found upon further system testing, where it was noticed that labels were missing from the conntrack events. Fixes: 193e30967897 ("openvswitch: Do not trigger events for unconfirmed connections.") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index c2d452e..85cd595 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -339,9 +339,7 @@ static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct) /* Initialize labels for a new, yet to be committed conntrack entry. Note that * since the new connection is not yet confirmed, and thus no-one else has - * access to it's labels, we simply write them over. Also, we refrain from - * triggering events, as receiving change events before the create event would - * be confusing. + * access to it's labels, we simply write them over. */ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, const struct ovs_key_ct_labels *labels, @@ -374,6 +372,11 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, & mask->ct_labels_32[i]); } + /* Labels are included in the IPCTNL_MSG_CT_NEW event only if the +* IPCT_LABEL bit it set in the event cache. +*/ + nf_conntrack_event_cache(IPCT_LABEL, ct); + memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); return 0; -- 2.1.4
[PATCH net-next] openvswitch: Set internal device max mtu to ETH_MAX_MTU.
Commit 91572088e3fd ("net: use core MTU range checking in core net infra") changed the openvswitch internal device to use the core net infra for controlling the MTU range, but failed to actually set the max_mtu as described in the commit message, which now defaults to ETH_DATA_LEN. This patch fixes this by setting max_mtu to ETH_MAX_MTU after ether_setup() call. Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/vport-internal_dev.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index 09141a1..89193a6 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -149,6 +149,8 @@ static void do_setup(struct net_device *netdev) { ether_setup(netdev); + netdev->max_mtu = ETH_MAX_MTU; + netdev->netdev_ops = _dev_netdev_ops; netdev->priv_flags &= ~IFF_TX_SKB_SHARING; -- 2.1.4
Re: [PATCH v3 net-next 00/10] openvswitch: Conntrack integration improvements.
> On Feb 9, 2017, at 8:44 AM, Pravin Shelar <pshe...@ovn.org> wrote: > > On Wed, Feb 8, 2017 at 5:30 PM, Jarno Rajahalme <ja...@ovn.org> wrote: >> This series improves the conntrack integration code in the openvswitch >> module by fixing outdated comments (patch 1), bugs (patches 2, 3, and >> 7), clarifying code (patches 4, 5, and 6), improving performance >> (patch 10), and adding new features enabling better translation from >> firewall admission policy to network configuration requested by user >> communities (patches 8 and 9). >> >> v3: Rebase to the current net-next, add the comment only changing >>patch 1 and reshuffle some of the patches as requested by Joe. >> > > All patches looks good to me. > > Acked-by: Pravin B Shelar <pshe...@ovn.org> > > Thanks. Thanks for the review! Joe spotted some english language and coding style issues I had not addressed in v3, so I just posted a v4. I included Joes and yours Acks, but not sure if that is the proper protocol, or if you need to Ack the v4 again. Sorry for the inconvenience! Jarno
Re: [PATCH v2 net-next 3/9] openvswitch: Simplify labels length logic.
> On Feb 8, 2017, at 2:47 PM, Joe Stringer <j...@ovn.org> wrote: > > On 8 February 2017 at 11:32, Jarno Rajahalme <ja...@ovn.org> wrote: >> Since 23014011ba42 ("netfilter: conntrack: support a fixed size of 128 >> distinct labels"), the size of conntrack labels extension has fixed to >> 128 bits, so we do not need to check for labels sizes shorter than 128 >> at run-time. This patch simplifies labels length logic accordingly, >> but allows the conntrack labels size to be increased in the future >> without breaking the build. In the event of conntrack labels >> increasing in size OVS would still be able to deal with the 128 first >> label bits. >> >> Suggested-by: Joe Stringer <j...@ovn.org> >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> >> --- >> net/openvswitch/conntrack.c | 22 +++--- >> 1 file changed, 11 insertions(+), 11 deletions(-) >> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c >> index 6730f09..a07e5cd 100644 >> --- a/net/openvswitch/conntrack.c >> +++ b/net/openvswitch/conntrack.c >> @@ -129,22 +129,22 @@ static u32 ovs_ct_get_mark(const struct nf_conn *ct) >> #endif >> } >> >> +/* Guard against conntrack labels max size shrinking below 128 bits. */ >> +#if NF_CT_LABELS_MAX_SIZE < 16 >> +#error NF_CT_LABELS_MAX_SIZE must be at least 16 bytes >> +#endif >> + >> static void ovs_ct_get_labels(const struct nf_conn *ct, >> struct ovs_key_ct_labels *labels) >> { >>struct nf_conn_labels *cl = ct ? nf_ct_labels_find(ct) : NULL; >> >> - if (cl) { >> - size_t len = sizeof(cl->bits); >> - >> - if (len > OVS_CT_LABELS_LEN) >> - len = OVS_CT_LABELS_LEN; >> - else if (len < OVS_CT_LABELS_LEN) >> - memset(labels, 0, OVS_CT_LABELS_LEN); >> - memcpy(labels, cl->bits, len); >> - } else { >> + if (cl) >> + memcpy(labels, cl->bits, >> + sizeof(cl->bits) > OVS_CT_LABELS_LEN >> + ? OVS_CT_LABELS_LEN : sizeof(cl->bits)); > > Is this to be defensive? If sizeof(cl->bits) is larger than > OVS_CT_LABELS_LEN, we'll use OVS_CT_LABELS_LEN; if it's equal, we'll > still use OVS_CT_LABELS_LEN; if it's less, the precompiler will fail > above. Why not memcpy(.., OVS_CT_LABELS_LEN) ? Right, I’ll post v4 simplifying this as suggested. Functionality is still the same, though. Jarno
Re: [PATCH v2 net-next 8/9] openvswitch: Add force commit.
> On Feb 8, 2017, at 3:53 PM, Joe Stringer <j...@ovn.org> wrote: > > On 8 February 2017 at 11:32, Jarno Rajahalme <ja...@ovn.org> wrote: >> Stateful network admission policy may allow connections to one >> direction and reject connections initiated in the other direction. >> After policy change it is possible that for a new connection an >> overlapping conntrack entry already exists, where the original >> direction of the existing connection is opposed to the new >> connection's initial packet. >> >> Most importantly, conntrack state relating to the current packet gets >> the "reply" designation based on whether the original direction tuple >> or the reply direction tuple matched. If this "directionality" is >> wrong w.r.t. to the stateful network admission policy it may happen >> that packets in neither direction are correctly admitted. >> >> This patch adds a new "force commit" option to the OVS conntrack >> action that checks the original direction of an existing conntrack >> entry. If that direction is opposed to the current packet, the >> existing conntrack entry is deleted and a new one is subsequently >> created in the correct direction. >> >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > > > >>if (help && rcu_access_pointer(help->helper) != info->helper) >>return false; >>} >> + /* Force conntrack entry direction to the current packet? */ >> + if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) { >> + /* Delete the conntrack entry if confirmed, else just release >> +* the reference. >> +*/ >> + if (nf_ct_is_confirmed(ct)) >> + nf_ct_delete(ct, 0, 0); >> + else >> + nf_conntrack_put(>ct_general); >> + skb->nfct = NULL; >> + skb->nfctinfo = 0; > > This can use nf_ct_set(). This is in v3. Jarno
[PATCH v4 net-next 05/10] openvswitch: Simplify labels length logic.
Since 23014011ba42 ("netfilter: conntrack: support a fixed size of 128 distinct labels"), the size of conntrack labels extension has fixed to 128 bits, so we do not need to check for labels sizes shorter than 128 at run-time. This patch simplifies labels length logic accordingly, but allows the conntrack labels size to be increased in the future without breaking the build. In the event of conntrack labels increasing in size OVS would still be able to deal with the 128 first label bits. Suggested-by: Joe Stringer <j...@ovn.org> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- net/openvswitch/conntrack.c | 20 +--- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index f23934c..fe2a410 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -129,22 +129,20 @@ static u32 ovs_ct_get_mark(const struct nf_conn *ct) #endif } +/* Guard against conntrack labels max size shrinking below 128 bits. */ +#if NF_CT_LABELS_MAX_SIZE < 16 +#error NF_CT_LABELS_MAX_SIZE must be at least 16 bytes +#endif + static void ovs_ct_get_labels(const struct nf_conn *ct, struct ovs_key_ct_labels *labels) { struct nf_conn_labels *cl = ct ? nf_ct_labels_find(ct) : NULL; - if (cl) { - size_t len = sizeof(cl->bits); - - if (len > OVS_CT_LABELS_LEN) - len = OVS_CT_LABELS_LEN; - else if (len < OVS_CT_LABELS_LEN) - memset(labels, 0, OVS_CT_LABELS_LEN); - memcpy(labels, cl->bits, len); - } else { + if (cl) + memcpy(labels, cl->bits, OVS_CT_LABELS_LEN); + else memset(labels, 0, OVS_CT_LABELS_LEN); - } } static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, @@ -274,7 +272,7 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, nf_ct_labels_ext_add(ct); cl = nf_ct_labels_find(ct); } - if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN) + if (!cl) return -ENOSPC; if (nf_ct_is_confirmed(ct)) { -- 2.1.4
[PATCH v4 net-next 01/10] openvswitch: Fix comments for skb->_nfct
Fix comments referring to skb 'nfct' and 'nfctinfo' fields now that they are combined into '_nfct'. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- net/openvswitch/conntrack.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index fbffe0e..5de6d12 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -157,7 +157,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, ovs_ct_get_labels(ct, >ct.labels); } -/* Update 'key' based on skb->nfct. If 'post_ct' is true, then OVS has +/* Update 'key' based on skb->_nfct. If 'post_ct' is true, then OVS has * previously sent the packet to conntrack via the ct action. If * 'keep_nat_flags' is true, the existing NAT flags retained, else they are * initialized from the connection status. @@ -421,12 +421,12 @@ ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h) /* Find an existing connection which this packet belongs to without * re-attributing statistics or modifying the connection state. This allows an - * skb->nfct lost due to an upcall to be recovered during actions execution. + * skb->_nfct lost due to an upcall to be recovered during actions execution. * * Must be called with rcu_read_lock. * - * On success, populates skb->nfct and skb->nfctinfo, and returns the - * connection. Returns NULL if there is no existing entry. + * On success, populates skb->_nfct and returns the connection. Returns NULL + * if there is no existing entry. */ static struct nf_conn * ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, @@ -464,7 +464,7 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, return ct; } -/* Determine whether skb->nfct is equal to the result of conntrack lookup. */ +/* Determine whether skb->_nfct is equal to the result of conntrack lookup. */ static bool skb_nfct_cached(struct net *net, const struct sw_flow_key *key, const struct ovs_conntrack_info *info, @@ -475,7 +475,7 @@ static bool skb_nfct_cached(struct net *net, ct = nf_ct_get(skb, ); /* If no ct, check if we have evidence that an existing conntrack entry -* might be found for this skb. This happens when we lose a skb->nfct +* might be found for this skb. This happens when we lose a skb->_nfct * due to an upcall. If the connection was not confirmed, it is not * cached and needs to be run through conntrack again. */ @@ -699,7 +699,7 @@ static int ovs_ct_nat(struct net *net, struct sw_flow_key *key, /* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if * not done already. Update key with new CT state after passing the packet * through conntrack. - * Note that if the packet is deemed invalid by conntrack, skb->nfct will be + * Note that if the packet is deemed invalid by conntrack, skb->_nfct will be * set to NULL and 0 will be returned. */ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, -- 2.1.4
[PATCH v4 net-next 04/10] openvswitch: Unionize ovs_key_ct_label with a u32 array.
Make the array of labels in struct ovs_key_ct_label an union, adding a u32 array of the same byte size as the existing u8 array. It is faster to loop through the labels 32 bits at the time, which is also the alignment of netlink attributes. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Joe Stringer <j...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- include/uapi/linux/openvswitch.h | 8 ++-- net/openvswitch/conntrack.c | 15 --- 2 files changed, 14 insertions(+), 9 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 375d812..96aee34 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -446,9 +446,13 @@ struct ovs_key_nd { __u8nd_tll[ETH_ALEN]; }; -#define OVS_CT_LABELS_LEN 16 +#define OVS_CT_LABELS_LEN_32 4 +#define OVS_CT_LABELS_LEN (OVS_CT_LABELS_LEN_32 * sizeof(__u32)) struct ovs_key_ct_labels { - __u8ct_labels[OVS_CT_LABELS_LEN]; + union { + __u8ct_labels[OVS_CT_LABELS_LEN]; + __u32 ct_labels_32[OVS_CT_LABELS_LEN_32]; + }; }; /* OVS_KEY_ATTR_CT_STATE flags */ diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index a6ff374..f23934c 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -281,20 +281,21 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, /* Triggers a change event, which makes sense only for * confirmed connections. */ - int err = nf_connlabels_replace(ct, (u32 *)labels, (u32 *)mask, - OVS_CT_LABELS_LEN / sizeof(u32)); + int err = nf_connlabels_replace(ct, labels->ct_labels_32, + mask->ct_labels_32, + OVS_CT_LABELS_LEN_32); if (err) return err; } else { u32 *dst = (u32 *)cl->bits; - const u32 *msk = (const u32 *)mask->ct_labels; - const u32 *lbl = (const u32 *)labels->ct_labels; + const u32 *msk = mask->ct_labels_32; + const u32 *lbl = labels->ct_labels_32; int i; /* No-one else has access to the non-confirmed entry, copy * labels over, keeping any bits we are not explicitly setting. */ - for (i = 0; i < OVS_CT_LABELS_LEN / sizeof(u32); i++) + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) dst[i] = (dst[i] & ~msk[i]) | (lbl[i] & msk[i]); } @@ -866,8 +867,8 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels) { size_t i; - for (i = 0; i < sizeof(*labels); i++) - if (labels->ct_labels[i]) + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + if (labels->ct_labels_32[i]) return true; return false; -- 2.1.4
[PATCH v4 net-next 08/10] openvswitch: Add original direction conntrack tuple to sw_flow_key.
Add the fields of the conntrack original direction 5-tuple to struct sw_flow_key. The new fields are initially marked as non-existent, and are populated whenever a conntrack action is executed and either finds or generates a conntrack entry. This means that these fields exist for all packets that were not rejected by conntrack as untrackable. The original tuple fields in the sw_flow_key are filled from the original direction tuple of the conntrack entry relating to the current packet, or from the original direction tuple of the master conntrack entry, if the current conntrack entry has a master. Generally, expected connections of connections having an assigned helper (e.g., FTP), have a master conntrack entry. The main purpose of the new conntrack original tuple fields is to allow matching on them for policy decision purposes, with the premise that the admissibility of tracked connections reply packets (as well as original direction packets), and both direction packets of any related connections may be based on ACL rules applying to the master connection's original direction 5-tuple. This also makes it easier to make policy decisions when the actual packet headers might have been transformed by NAT, as the original direction 5-tuple represents the packet headers before any such transformation. When using the original direction 5-tuple the admissibility of return and/or related packets need not be based on the mere existence of a conntrack entry, allowing separation of admission policy from the established conntrack state. While existence of a conntrack entry is required for admission of the return or related packets, policy changes can render connections that were initially admitted to be rejected or dropped afterwards. If the admission of the return and related packets was based on mere conntrack state (e.g., connection being in an established state), a policy change that would make the connection rejected or dropped would need to find and delete all conntrack entries affected by such a change. When using the original direction 5-tuple matching the affected conntrack entries can be allowed to time out instead, as the established state of the connection would not need to be the basis for packet admission any more. It should be noted that the directionality of related connections may be the same or different than that of the master connection, and neither the original direction 5-tuple nor the conntrack state bits carry this information. If needed, the directionality of the master connection can be stored in master's conntrack mark or labels, which are automatically inherited by the expected related connections. The fact that neither ARP nor ND packets are trackable by conntrack allows mutual exclusion between ARP/ND and the new conntrack original tuple fields. Hence, the IP addresses are overlaid in union with ARP and ND fields. This allows the sw_flow_key to not grow much due to this patch, but it also means that we must be careful to never use the new key fields with ARP or ND packets. ARP is easy to distinguish and keep mutually exclusive based on the ethernet type, but ND being an ICMPv6 protocol requires a bit more attention. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Joe Stringer <j...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- include/uapi/linux/openvswitch.h | 20 +- net/openvswitch/actions.c| 2 + net/openvswitch/conntrack.c | 86 +--- net/openvswitch/conntrack.h | 10 - net/openvswitch/flow.c | 34 +--- net/openvswitch/flow.h | 49 ++- net/openvswitch/flow_netlink.c | 85 +-- net/openvswitch/flow_netlink.h | 7 +++- 8 files changed, 246 insertions(+), 47 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 96aee34..90af8b8 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -1,6 +1,6 @@ /* - * Copyright (c) 2007-2013 Nicira, Inc. + * Copyright (c) 2007-2017 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -331,6 +331,8 @@ enum ovs_key_attr { OVS_KEY_ATTR_CT_ZONE, /* u16 connection tracking zone. */ OVS_KEY_ATTR_CT_MARK, /* u32 connection tracking mark */ OVS_KEY_ATTR_CT_LABELS, /* 16-octet connection tracking label */ + OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, /* struct ovs_key_ct_tuple_ipv4 */ + OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, /* struct ovs_key_ct_tuple_ipv6 */ #ifdef __KERNEL__ OVS_KEY_ATTR_TUNNEL_INFO, /* struct ip_tunnel_info */ @@ -472,6 +474,22 @@ struct ovs_key_ct_labels { #define OVS_CS_F_NAT_MASK (OVS_CS_F_SRC_NAT | OVS_CS_F_DST_NAT) +struct ovs_key_ct_tuple_ipv4 { + __be32 ipv4_src; + __be32 ipv4_dst; +
[PATCH v4 net-next 10/10] openvswitch: Pack struct sw_flow_key.
struct sw_flow_key has two 16-bit holes. Move the most matched conntrack match fields there. In some typical cases this reduces the size of the key that needs to be hashed into half and into one cache line. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Joe Stringer <j...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- net/openvswitch/conntrack.c| 40 net/openvswitch/conntrack.h| 8 net/openvswitch/flow.h | 14 -- net/openvswitch/flow_netlink.c | 11 +++ 4 files changed, 39 insertions(+), 34 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 8b15bab..c2d452e 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -152,7 +152,7 @@ static void __ovs_ct_update_key_orig_tp(struct sw_flow_key *key, const struct nf_conntrack_tuple *orig, u8 icmp_proto) { - key->ct.orig_proto = orig->dst.protonum; + key->ct_orig_proto = orig->dst.protonum; if (orig->dst.protonum == icmp_proto) { key->ct.orig_tp.src = htons(orig->dst.u.icmp.type); key->ct.orig_tp.dst = htons(orig->dst.u.icmp.code); @@ -166,8 +166,8 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, const struct nf_conntrack_zone *zone, const struct nf_conn *ct) { - key->ct.state = state; - key->ct.zone = zone->id; + key->ct_state = state; + key->ct_zone = zone->id; key->ct.mark = ovs_ct_get_mark(ct); ovs_ct_get_labels(ct, >ct.labels); @@ -195,10 +195,10 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, return; } } - /* Clear 'ct.orig_proto' to mark the non-existence of conntrack + /* Clear 'ct_orig_proto' to mark the non-existence of conntrack * original direction key fields. */ - key->ct.orig_proto = 0; + key->ct_orig_proto = 0; } /* Update 'key' based on skb->_nfct. If 'post_ct' is true, then OVS has @@ -228,7 +228,7 @@ static void ovs_ct_update_key(const struct sk_buff *skb, if (ct->master) state |= OVS_CS_F_RELATED; if (keep_nat_flags) { - state |= key->ct.state & OVS_CS_F_NAT_MASK; + state |= key->ct_state & OVS_CS_F_NAT_MASK; } else { if (ct->status & IPS_SRC_NAT) state |= OVS_CS_F_SRC_NAT; @@ -259,11 +259,11 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key) int ovs_ct_put_key(const struct sw_flow_key *swkey, const struct sw_flow_key *output, struct sk_buff *skb) { - if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct.state)) + if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct_state)) return -EMSGSIZE; if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) && - nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct.zone)) + nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct_zone)) return -EMSGSIZE; if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) && @@ -275,14 +275,14 @@ int ovs_ct_put_key(const struct sw_flow_key *swkey, >ct.labels)) return -EMSGSIZE; - if (swkey->ct.orig_proto) { + if (swkey->ct_orig_proto) { if (swkey->eth.type == htons(ETH_P_IP)) { struct ovs_key_ct_tuple_ipv4 orig = { output->ipv4.ct_orig.src, output->ipv4.ct_orig.dst, output->ct.orig_tp.src, output->ct.orig_tp.dst, - output->ct.orig_proto, + output->ct_orig_proto, }; if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, sizeof(orig), )) @@ -293,7 +293,7 @@ int ovs_ct_put_key(const struct sw_flow_key *swkey, IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.dst), output->ct.orig_tp.src, output->ct.orig_tp.dst, - output->ct.orig_proto, + output->ct_orig_proto, }; if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, sizeof(orig), )) @@ -612,11 +612,11 @@ static bool skb_nfct_cached(struct net *net,
[PATCH v4 net-next 03/10] openvswitch: Do not trigger events for unconfirmed connections.
Receiving change events before the 'new' event for the connection has been received can be confusing. Avoid triggering change events for setting conntrack mark or labels before the conntrack entry has been confirmed. Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark") Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Joe Stringer <j...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- net/openvswitch/conntrack.c | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 4df9a54..a6ff374 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -245,7 +245,8 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, new_mark = ct_mark | (ct->mark & ~(mask)); if (ct->mark != new_mark) { ct->mark = new_mark; - nf_conntrack_event_cache(IPCT_MARK, ct); + if (nf_ct_is_confirmed(ct)) + nf_conntrack_event_cache(IPCT_MARK, ct); key->ct.mark = new_mark; } @@ -262,7 +263,6 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, enum ip_conntrack_info ctinfo; struct nf_conn_labels *cl; struct nf_conn *ct; - int err; /* The connection could be invalid, in which case set_label is no-op.*/ ct = nf_ct_get(skb, ); @@ -277,10 +277,26 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN) return -ENOSPC; - err = nf_connlabels_replace(ct, (u32 *)labels, (u32 *)mask, - OVS_CT_LABELS_LEN / sizeof(u32)); - if (err) - return err; + if (nf_ct_is_confirmed(ct)) { + /* Triggers a change event, which makes sense only for +* confirmed connections. +*/ + int err = nf_connlabels_replace(ct, (u32 *)labels, (u32 *)mask, + OVS_CT_LABELS_LEN / sizeof(u32)); + if (err) + return err; + } else { + u32 *dst = (u32 *)cl->bits; + const u32 *msk = (const u32 *)mask->ct_labels; + const u32 *lbl = (const u32 *)labels->ct_labels; + int i; + + /* No-one else has access to the non-confirmed entry, copy +* labels over, keeping any bits we are not explicitly setting. +*/ + for (i = 0; i < OVS_CT_LABELS_LEN / sizeof(u32); i++) + dst[i] = (dst[i] & ~msk[i]) | (lbl[i] & msk[i]); + } ovs_ct_get_labels(ct, >ct.labels); return 0; -- 2.1.4
[PATCH v4 net-next 00/10] openvswitch: Conntrack integration improvements.
This series improves the conntrack integration code in the openvswitch module by fixing outdated comments (patch 1), bugs (patches 2, 3, and 7), clarifying code (patches 4, 5, and 6), improving performance (patch 10), and adding new features enabling better translation from firewall admission policy to network configuration requested by user communities (patches 8 and 9). Please note that v3 of the series was Acked by Pravin, but I posted a v4 addressing the remaining english language and coding style issues posted by Joe on v2. v4: Address remaining language and coding style issues from Joe. v3: Rebase to the current net-next, add the comment only changing patch 1 and reshuffle some of the patches as requested by Joe. Jarno Rajahalme (10): openvswitch: Fix comments for skb->_nfct openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted. openvswitch: Do not trigger events for unconfirmed connections. openvswitch: Unionize ovs_key_ct_label with a u32 array. openvswitch: Simplify labels length logic. openvswitch: Refactor labels initialization. openvswitch: Inherit master's labels. openvswitch: Add original direction conntrack tuple to sw_flow_key. openvswitch: Add force commit. openvswitch: Pack struct sw_flow_key. include/uapi/linux/openvswitch.h | 33 - net/openvswitch/actions.c| 2 + net/openvswitch/conntrack.c | 296 ++- net/openvswitch/conntrack.h | 14 +- net/openvswitch/flow.c | 34 - net/openvswitch/flow.h | 55 ++-- net/openvswitch/flow_netlink.c | 92 +--- net/openvswitch/flow_netlink.h | 7 +- 8 files changed, 420 insertions(+), 113 deletions(-) -- 2.1.4
[PATCH v4 net-next 06/10] openvswitch: Refactor labels initialization.
Refactoring conntrack labels initialization makes changes in later patches easier to review. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- net/openvswitch/conntrack.c | 104 ++-- 1 file changed, 62 insertions(+), 42 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index fe2a410..7c5bb98 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -227,19 +227,12 @@ int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb) return 0; } -static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, +static int ovs_ct_set_mark(struct nf_conn *ct, struct sw_flow_key *key, u32 ct_mark, u32 mask) { #if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) - enum ip_conntrack_info ctinfo; - struct nf_conn *ct; u32 new_mark; - /* The connection could be invalid, in which case set_mark is no-op. */ - ct = nf_ct_get(skb, ); - if (!ct) - return 0; - new_mark = ct_mark | (ct->mark & ~(mask)); if (ct->mark != new_mark) { ct->mark = new_mark; @@ -254,50 +247,66 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, #endif } -static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, -const struct ovs_key_ct_labels *labels, -const struct ovs_key_ct_labels *mask) +static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct) { - enum ip_conntrack_info ctinfo; struct nf_conn_labels *cl; - struct nf_conn *ct; - - /* The connection could be invalid, in which case set_label is no-op.*/ - ct = nf_ct_get(skb, ); - if (!ct) - return 0; cl = nf_ct_labels_find(ct); if (!cl) { nf_ct_labels_ext_add(ct); cl = nf_ct_labels_find(ct); } + + return cl; +} + +/* Initialize labels for a new, yet to be committed conntrack entry. Note that + * since the new connection is not yet confirmed, and thus no-one else has + * access to it's labels, we simply write them over. Also, we refrain from + * triggering events, as receiving change events before the create event would + * be confusing. + */ +static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, + const struct ovs_key_ct_labels *labels, + const struct ovs_key_ct_labels *mask) +{ + struct nf_conn_labels *cl; + u32 *dst; + int i; + + cl = ovs_ct_get_conn_labels(ct); if (!cl) return -ENOSPC; - if (nf_ct_is_confirmed(ct)) { - /* Triggers a change event, which makes sense only for -* confirmed connections. -*/ - int err = nf_connlabels_replace(ct, labels->ct_labels_32, - mask->ct_labels_32, - OVS_CT_LABELS_LEN_32); - if (err) - return err; - } else { - u32 *dst = (u32 *)cl->bits; - const u32 *msk = mask->ct_labels_32; - const u32 *lbl = labels->ct_labels_32; - int i; + dst = (u32 *)cl->bits; + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | + (labels->ct_labels_32[i] & mask->ct_labels_32[i]); - /* No-one else has access to the non-confirmed entry, copy -* labels over, keeping any bits we are not explicitly setting. -*/ - for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) - dst[i] = (dst[i] & ~msk[i]) | (lbl[i] & msk[i]); - } + memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); + + return 0; +} + +static int ovs_ct_set_labels(struct nf_conn *ct, struct sw_flow_key *key, +const struct ovs_key_ct_labels *labels, +const struct ovs_key_ct_labels *mask) +{ + struct nf_conn_labels *cl; + int err; + + cl = ovs_ct_get_conn_labels(ct); + if (!cl) + return -ENOSPC; + + err = nf_connlabels_replace(ct, labels->ct_labels_32, + mask->ct_labels_32, + OVS_CT_LABELS_LEN_32); + if (err) + return err; + + memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); - ovs_ct_get_labels(ct, >ct.labels); return 0; } @@ -877,25 +886,36 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, const struct ovs_conntrack_info *info,
[PATCH v4 net-next 07/10] openvswitch: Inherit master's labels.
We avoid calling into nf_conntrack_in() for expected connections, as that would remove the expectation that we want to stick around until we are ready to commit the connection. Instead, we do a lookup in the expectation table directly. However, after a successful expectation lookup we have set the flow key label field from the master connection, whereas nf_conntrack_in() does not do this. This leads to master's labels being inherited after an expectation lookup, but those labels not being inherited after the corresponding conntrack action with a commit flag. This patch resolves the problem by changing the commit code path to also inherit the master's labels to the expected connection. Resolving this conflict in favor of inheriting the labels allows more information be passed from the master connection to related connections, which would otherwise be much harder if the 32 bits in the connmark are not enough. Labels can still be set explicitly, so this change only affects the default values of the labels in presense of a master connection. Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- net/openvswitch/conntrack.c | 45 +++-- 1 file changed, 31 insertions(+), 14 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 7c5bb98..f989ccf 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -73,6 +73,8 @@ struct ovs_conntrack_info { #endif }; +static bool labels_nonzero(const struct ovs_key_ct_labels *labels); + static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info); static u16 key_to_nfproto(const struct sw_flow_key *key) @@ -270,18 +272,32 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, const struct ovs_key_ct_labels *labels, const struct ovs_key_ct_labels *mask) { - struct nf_conn_labels *cl; - u32 *dst; - int i; + struct nf_conn_labels *cl, *master_cl; + bool have_mask = labels_nonzero(mask); + + /* Inherit master's labels to the related connection? */ + master_cl = ct->master ? nf_ct_labels_find(ct->master) : NULL; + + if (!master_cl && !have_mask) + return 0; /* Nothing to do. */ cl = ovs_ct_get_conn_labels(ct); if (!cl) return -ENOSPC; - dst = (u32 *)cl->bits; - for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) - dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | - (labels->ct_labels_32[i] & mask->ct_labels_32[i]); + /* Inherit the master's labels, if any. */ + if (master_cl) + *cl = *master_cl; + + if (have_mask) { + u32 *dst = (u32 *)cl->bits; + int i; + + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | + (labels->ct_labels_32[i] +& mask->ct_labels_32[i]); + } memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); @@ -909,13 +925,14 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, if (err) return err; } - if (labels_nonzero(>labels.mask)) { - if (!nf_ct_is_confirmed(ct)) - err = ovs_ct_init_labels(ct, key, >labels.value, ->labels.mask); - else - err = ovs_ct_set_labels(ct, key, >labels.value, - >labels.mask); + if (!nf_ct_is_confirmed(ct)) { + err = ovs_ct_init_labels(ct, key, >labels.value, +>labels.mask); + if (err) + return err; + } else if (labels_nonzero(>labels.mask)) { + err = ovs_ct_set_labels(ct, key, >labels.value, + >labels.mask); if (err) return err; } -- 2.1.4
[PATCH v4 net-next 09/10] openvswitch: Add force commit.
Stateful network admission policy may allow connections to one direction and reject connections initiated in the other direction. After policy change it is possible that for a new connection an overlapping conntrack entry already exists, where the original direction of the existing connection is opposed to the new connection's initial packet. Most importantly, conntrack state relating to the current packet gets the "reply" designation based on whether the original direction tuple or the reply direction tuple matched. If this "directionality" is wrong w.r.t. to the stateful network admission policy it may happen that packets in neither direction are correctly admitted. This patch adds a new "force commit" option to the OVS conntrack action that checks the original direction of an existing conntrack entry. If that direction is opposed to the current packet, the existing conntrack entry is deleted and a new one is subsequently created in the correct direction. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- include/uapi/linux/openvswitch.h | 5 + net/openvswitch/conntrack.c | 26 -- 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 90af8b8..7f41f7d 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -674,6 +674,10 @@ struct ovs_action_hash { * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG. * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address * translation (NAT) on the packet. + * @OVS_CT_ATTR_FORCE_COMMIT: Like %OVS_CT_ATTR_COMMIT, but instead of doing + * nothing if the connection is already committed will check that the current + * packet is in conntrack entry's original direction. If directionality does + * not match, will delete the existing conntrack entry and commit a new one. */ enum ovs_ct_attr { OVS_CT_ATTR_UNSPEC, @@ -684,6 +688,7 @@ enum ovs_ct_attr { OVS_CT_ATTR_HELPER, /* netlink helper to assist detection of related connections. */ OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ + OVS_CT_ATTR_FORCE_COMMIT, /* No argument */ __OVS_CT_ATTR_MAX }; diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index bfd7606..8b15bab 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -65,6 +65,7 @@ struct ovs_conntrack_info { struct nf_conn *ct; u8 commit : 1; u8 nat : 3; /* enum ovs_ct_nat */ + u8 force : 1; u16 family; struct md_mark mark; struct md_labels labels; @@ -613,10 +614,13 @@ static bool skb_nfct_cached(struct net *net, */ if (!ct && key->ct.state & OVS_CS_F_TRACKED && !(key->ct.state & OVS_CS_F_INVALID) && - key->ct.zone == info->zone.id) + key->ct.zone == info->zone.id) { ct = ovs_ct_find_existing(net, >zone, info->family, skb, !!(key->ct.state & OVS_CS_F_NAT_MASK)); + if (ct) + nf_ct_get(skb, ); + } if (!ct) return false; if (!net_eq(net, read_pnet(>ct_net))) @@ -630,6 +634,18 @@ static bool skb_nfct_cached(struct net *net, if (help && rcu_access_pointer(help->helper) != info->helper) return false; } + /* Force conntrack entry direction to the current packet? */ + if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) { + /* Delete the conntrack entry if confirmed, else just release +* the reference. +*/ + if (nf_ct_is_confirmed(ct)) + nf_ct_delete(ct, 0, 0); + else + nf_conntrack_put(>ct_general); + nf_ct_set(skb, NULL, 0); + return false; + } return true; } @@ -1207,6 +1223,7 @@ static int parse_nat(const struct nlattr *attr, static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = { [OVS_CT_ATTR_COMMIT]= { .minlen = 0, .maxlen = 0 }, + [OVS_CT_ATTR_FORCE_COMMIT] = { .minlen = 0, .maxlen = 0 }, [OVS_CT_ATTR_ZONE] = { .minlen = sizeof(u16), .maxlen = sizeof(u16) }, [OVS_CT_ATTR_MARK] = { .minlen = sizeof(struct md_mark), @@ -1246,6 +1263,9 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, } switch (type) { + case OVS_CT_ATTR_FORCE_COMMIT: + info-
[PATCH v4 net-next 02/10] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
The conntrack lookup for existing connections fails to invert the packet 5-tuple for NATted packets, and therefore fails to find the existing conntrack entry. Conntrack only stores 5-tuples for incoming packets, and there are various situations where a lookup on a packet that has already been transformed by NAT needs to be made. Looking up an existing conntrack entry upon executing packet received from the userspace is one of them. This patch fixes ovs_ct_find_existing() to invert the packet 5-tuple for the conntrack lookup whenever the packet has already been transformed by conntrack from its input form as evidenced by one of the NAT flags being set in the conntrack state metadata. Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Joe Stringer <j...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> --- net/openvswitch/conntrack.c | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 5de6d12..4df9a54 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -430,7 +430,7 @@ ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h) */ static struct nf_conn * ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, -u8 l3num, struct sk_buff *skb) +u8 l3num, struct sk_buff *skb, bool natted) { struct nf_conntrack_l3proto *l3proto; struct nf_conntrack_l4proto *l4proto; @@ -453,6 +453,17 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, return NULL; } + /* Must invert the tuple if skb has been transformed by NAT. */ + if (natted) { + struct nf_conntrack_tuple inverse; + + if (!nf_ct_invert_tuple(, , l3proto, l4proto)) { + pr_debug("ovs_ct_find_existing: Inversion failed!\n"); + return NULL; + } + tuple = inverse; + } + /* look for tuple match */ h = nf_conntrack_find_get(net, zone, ); if (!h) @@ -460,6 +471,13 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, ct = nf_ct_tuplehash_to_ctrack(h); + /* Inverted packet tuple matches the reverse direction conntrack tuple, +* select the other tuplehash to get the right 'ctinfo' bits for this +* packet. +*/ + if (natted) + h = >tuplehash[!h->tuple.dst.dir]; + nf_ct_set(skb, ct, ovs_ct_get_info(h)); return ct; } @@ -482,7 +500,9 @@ static bool skb_nfct_cached(struct net *net, if (!ct && key->ct.state & OVS_CS_F_TRACKED && !(key->ct.state & OVS_CS_F_INVALID) && key->ct.zone == info->zone.id) - ct = ovs_ct_find_existing(net, >zone, info->family, skb); + ct = ovs_ct_find_existing(net, >zone, info->family, skb, + !!(key->ct.state +& OVS_CS_F_NAT_MASK)); if (!ct) return false; if (!net_eq(net, read_pnet(>ct_net))) -- 2.1.4
Re: [PATCH v2 net-next 5/9] openvswitch: Refactor labels initialization.
> On Feb 8, 2017, at 3:06 PM, Joe Stringer <j...@ovn.org> wrote: > > On 8 February 2017 at 11:32, Jarno Rajahalme <ja...@ovn.org> wrote: >> Refactoring conntrack labels initialization makes chenges in later > > *changes > >> patches easier to review. >> >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > > Minor other nit: > >> >>cl = nf_ct_labels_find(ct); >>if (!cl) { >>nf_ct_labels_ext_add(ct); >>cl = nf_ct_labels_find(ct); >>} >> + >> + return cl; >> +} >> + >> +/* Initialize labels for a new, to be committed conntrack entry. Note that > > Maybe insert 'yet': > > Initialize labels for a new, yet to be committed conntrack entry. Note that Fixed these typos for v4. Jarno
Re: [PATCH v2 net-next 6/9] openvswitch: Inherit master's labels.
> On Feb 8, 2017, at 3:25 PM, Joe Stringer <j...@ovn.org> wrote: > > On 8 February 2017 at 11:32, Jarno Rajahalme <ja...@ovn.org> wrote: >> We avoid calling into nf_conntrack_in() for expected connections, as >> that would remove the expectation that we want to stick around until >> we are ready to commit the connection. Instead, we do a lookup in the >> expectation table directly. However, after a successful expectation >> lookup we have set the flow key label field from the master >> connection, whereas nf_conntrack_in() does not do this. This leads to >> master's labels being inherited after an expectation lookup, but those >> labels not being inherited after the corresponding conntrack action >> with a commit flag. >> >> This patch resolves the problem by changing the commit code path to >> also inherit the master's labels to the expected connection. >> Resolving this conflict in favor or inheriting the labels allows more > > *of inheriting > >> information be passed from the master connection to related >> connections, which would otherwise be much harder if the 32 bits in >> the connmark are not enough. Labels can still be set explicitly, so >> this change only affects the default values of the labels in presense >> of a master connection. >> >> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > > > >> @@ -272,18 +274,32 @@ static int ovs_ct_init_labels(struct nf_conn *ct, >> struct sw_flow_key *key, >> const struct ovs_key_ct_labels *labels, >> const struct ovs_key_ct_labels *mask) >> { >> - struct nf_conn_labels *cl; >> - u32 *dst; >> - int i; >> + struct nf_conn_labels *cl, *master_cl; >> + bool have_mask = labels_nonzero(mask); >> + >> + /* Inherit master's labels to the related connection? */ >> + master_cl = (ct->master) ? nf_ct_labels_find(ct->master) : NULL; > > I don't think (ct->master) needs the parentheses. Fixed these style issues for v4. Jarno
[PATCH v3 net-next 10/10] openvswitch: Pack struct sw_flow_key.
struct sw_flow_key has two 16-bit holes. Move the most matched conntrack match fields there. In some typical cases this reduces the size of the key that needs to be hashed into half and into one cache line. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c| 40 net/openvswitch/conntrack.h| 8 net/openvswitch/flow.h | 14 -- net/openvswitch/flow_netlink.c | 11 +++ 4 files changed, 39 insertions(+), 34 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index de47782..a12825e 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -154,7 +154,7 @@ static void __ovs_ct_update_key_orig_tp(struct sw_flow_key *key, const struct nf_conntrack_tuple *orig, u8 icmp_proto) { - key->ct.orig_proto = orig->dst.protonum; + key->ct_orig_proto = orig->dst.protonum; if (orig->dst.protonum == icmp_proto) { key->ct.orig_tp.src = htons(orig->dst.u.icmp.type); key->ct.orig_tp.dst = htons(orig->dst.u.icmp.code); @@ -168,8 +168,8 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, const struct nf_conntrack_zone *zone, const struct nf_conn *ct) { - key->ct.state = state; - key->ct.zone = zone->id; + key->ct_state = state; + key->ct_zone = zone->id; key->ct.mark = ovs_ct_get_mark(ct); ovs_ct_get_labels(ct, >ct.labels); @@ -197,10 +197,10 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, return; } } - /* Clear 'ct.orig_proto' to mark the non-existence of conntrack + /* Clear 'ct_orig_proto' to mark the non-existence of conntrack * original direction key fields. */ - key->ct.orig_proto = 0; + key->ct_orig_proto = 0; } /* Update 'key' based on skb->_nfct. If 'post_ct' is true, then OVS has @@ -230,7 +230,7 @@ static void ovs_ct_update_key(const struct sk_buff *skb, if (ct->master) state |= OVS_CS_F_RELATED; if (keep_nat_flags) { - state |= key->ct.state & OVS_CS_F_NAT_MASK; + state |= key->ct_state & OVS_CS_F_NAT_MASK; } else { if (ct->status & IPS_SRC_NAT) state |= OVS_CS_F_SRC_NAT; @@ -261,11 +261,11 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key) int ovs_ct_put_key(const struct sw_flow_key *swkey, const struct sw_flow_key *output, struct sk_buff *skb) { - if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct.state)) + if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct_state)) return -EMSGSIZE; if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) && - nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct.zone)) + nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct_zone)) return -EMSGSIZE; if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) && @@ -277,14 +277,14 @@ int ovs_ct_put_key(const struct sw_flow_key *swkey, >ct.labels)) return -EMSGSIZE; - if (swkey->ct.orig_proto) { + if (swkey->ct_orig_proto) { if (swkey->eth.type == htons(ETH_P_IP)) { struct ovs_key_ct_tuple_ipv4 orig = { output->ipv4.ct_orig.src, output->ipv4.ct_orig.dst, output->ct.orig_tp.src, output->ct.orig_tp.dst, - output->ct.orig_proto, + output->ct_orig_proto, }; if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, sizeof(orig), )) @@ -295,7 +295,7 @@ int ovs_ct_put_key(const struct sw_flow_key *swkey, IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.dst), output->ct.orig_tp.src, output->ct.orig_tp.dst, - output->ct.orig_proto, + output->ct_orig_proto, }; if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, sizeof(orig), )) @@ -614,11 +614,11 @@ static bool skb_nfct_cached(struct net *net, * due to an upcall. If the connection was not confirmed, it is not * cached and needs to be run thro
[PATCH v3 net-next 04/10] openvswitch: Unionize ovs_key_ct_label with a u32 array.
Make the array of labels in struct ovs_key_ct_label an union, adding a u32 array of the same byte size as the existing u8 array. It is faster to loop through the labels 32 bits at the time, which is also the alignment of netlink attributes. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 8 ++-- net/openvswitch/conntrack.c | 15 --- 2 files changed, 14 insertions(+), 9 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 375d812..96aee34 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -446,9 +446,13 @@ struct ovs_key_nd { __u8nd_tll[ETH_ALEN]; }; -#define OVS_CT_LABELS_LEN 16 +#define OVS_CT_LABELS_LEN_32 4 +#define OVS_CT_LABELS_LEN (OVS_CT_LABELS_LEN_32 * sizeof(__u32)) struct ovs_key_ct_labels { - __u8ct_labels[OVS_CT_LABELS_LEN]; + union { + __u8ct_labels[OVS_CT_LABELS_LEN]; + __u32 ct_labels_32[OVS_CT_LABELS_LEN_32]; + }; }; /* OVS_KEY_ATTR_CT_STATE flags */ diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index a6ff374..f23934c 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -281,20 +281,21 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, /* Triggers a change event, which makes sense only for * confirmed connections. */ - int err = nf_connlabels_replace(ct, (u32 *)labels, (u32 *)mask, - OVS_CT_LABELS_LEN / sizeof(u32)); + int err = nf_connlabels_replace(ct, labels->ct_labels_32, + mask->ct_labels_32, + OVS_CT_LABELS_LEN_32); if (err) return err; } else { u32 *dst = (u32 *)cl->bits; - const u32 *msk = (const u32 *)mask->ct_labels; - const u32 *lbl = (const u32 *)labels->ct_labels; + const u32 *msk = mask->ct_labels_32; + const u32 *lbl = labels->ct_labels_32; int i; /* No-one else has access to the non-confirmed entry, copy * labels over, keeping any bits we are not explicitly setting. */ - for (i = 0; i < OVS_CT_LABELS_LEN / sizeof(u32); i++) + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) dst[i] = (dst[i] & ~msk[i]) | (lbl[i] & msk[i]); } @@ -866,8 +867,8 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels) { size_t i; - for (i = 0; i < sizeof(*labels); i++) - if (labels->ct_labels[i]) + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + if (labels->ct_labels_32[i]) return true; return false; -- 2.1.4
[PATCH v3 net-next 07/10] openvswitch: Inherit master's labels.
We avoid calling into nf_conntrack_in() for expected connections, as that would remove the expectation that we want to stick around until we are ready to commit the connection. Instead, we do a lookup in the expectation table directly. However, after a successful expectation lookup we have set the flow key label field from the master connection, whereas nf_conntrack_in() does not do this. This leads to master's labels being inherited after an expectation lookup, but those labels not being inherited after the corresponding conntrack action with a commit flag. This patch resolves the problem by changing the commit code path to also inherit the master's labels to the expected connection. Resolving this conflict in favor or inheriting the labels allows more information be passed from the master connection to related connections, which would otherwise be much harder if the 32 bits in the connmark are not enough. Labels can still be set explicitly, so this change only affects the default values of the labels in presense of a master connection. Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 45 +++-- 1 file changed, 31 insertions(+), 14 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 0e038ee..5fbadcd 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -73,6 +73,8 @@ struct ovs_conntrack_info { #endif }; +static bool labels_nonzero(const struct ovs_key_ct_labels *labels); + static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info); static u16 key_to_nfproto(const struct sw_flow_key *key) @@ -272,18 +274,32 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, const struct ovs_key_ct_labels *labels, const struct ovs_key_ct_labels *mask) { - struct nf_conn_labels *cl; - u32 *dst; - int i; + struct nf_conn_labels *cl, *master_cl; + bool have_mask = labels_nonzero(mask); + + /* Inherit master's labels to the related connection? */ + master_cl = (ct->master) ? nf_ct_labels_find(ct->master) : NULL; + + if (!master_cl && !have_mask) + return 0; /* Nothing to do. */ cl = ovs_ct_get_conn_labels(ct); if (!cl) return -ENOSPC; - dst = (u32 *)cl->bits; - for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) - dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | - (labels->ct_labels_32[i] & mask->ct_labels_32[i]); + /* Inherit the master's labels, if any. */ + if (master_cl) + *cl = *master_cl; + + if (have_mask) { + u32 *dst = (u32 *)cl->bits; + int i; + + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | + (labels->ct_labels_32[i] +& mask->ct_labels_32[i]); + } memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); @@ -911,13 +927,14 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, if (err) return err; } - if (labels_nonzero(>labels.mask)) { - if (!nf_ct_is_confirmed(ct)) - err = ovs_ct_init_labels(ct, key, >labels.value, ->labels.mask); - else - err = ovs_ct_set_labels(ct, key, >labels.value, - >labels.mask); + if (!nf_ct_is_confirmed(ct)) { + err = ovs_ct_init_labels(ct, key, >labels.value, +>labels.mask); + if (err) + return err; + } else if (labels_nonzero(>labels.mask)) { + err = ovs_ct_set_labels(ct, key, >labels.value, + >labels.mask); if (err) return err; } -- 2.1.4
[PATCH v3 net-next 05/10] openvswitch: Simplify labels length logic.
Since 23014011ba42 ("netfilter: conntrack: support a fixed size of 128 distinct labels"), the size of conntrack labels extension has fixed to 128 bits, so we do not need to check for labels sizes shorter than 128 at run-time. This patch simplifies labels length logic accordingly, but allows the conntrack labels size to be increased in the future without breaking the build. In the event of conntrack labels increasing in size OVS would still be able to deal with the 128 first label bits. Suggested-by: Joe Stringer <j...@ovn.org> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index f23934c..c7db4da 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -129,22 +129,22 @@ static u32 ovs_ct_get_mark(const struct nf_conn *ct) #endif } +/* Guard against conntrack labels max size shrinking below 128 bits. */ +#if NF_CT_LABELS_MAX_SIZE < 16 +#error NF_CT_LABELS_MAX_SIZE must be at least 16 bytes +#endif + static void ovs_ct_get_labels(const struct nf_conn *ct, struct ovs_key_ct_labels *labels) { struct nf_conn_labels *cl = ct ? nf_ct_labels_find(ct) : NULL; - if (cl) { - size_t len = sizeof(cl->bits); - - if (len > OVS_CT_LABELS_LEN) - len = OVS_CT_LABELS_LEN; - else if (len < OVS_CT_LABELS_LEN) - memset(labels, 0, OVS_CT_LABELS_LEN); - memcpy(labels, cl->bits, len); - } else { + if (cl) + memcpy(labels, cl->bits, + sizeof(cl->bits) > OVS_CT_LABELS_LEN + ? OVS_CT_LABELS_LEN : sizeof(cl->bits)); + else memset(labels, 0, OVS_CT_LABELS_LEN); - } } static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, @@ -274,7 +274,7 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, nf_ct_labels_ext_add(ct); cl = nf_ct_labels_find(ct); } - if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN) + if (!cl) return -ENOSPC; if (nf_ct_is_confirmed(ct)) { -- 2.1.4
[PATCH v3 net-next 00/10] openvswitch: Conntrack integration improvements.
This series improves the conntrack integration code in the openvswitch module by fixing outdated comments (patch 1), bugs (patches 2, 3, and 7), clarifying code (patches 4, 5, and 6), improving performance (patch 10), and adding new features enabling better translation from firewall admission policy to network configuration requested by user communities (patches 8 and 9). v3: Rebase to the current net-next, add the comment only changing patch 1 and reshuffle some of the patches as requested by Joe. Jarno Rajahalme (10): openvswitch: Fix comments for skb->_nfct openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted. openvswitch: Do not trigger events for unconfirmed connections. openvswitch: Unionize ovs_key_ct_label with a u32 array. openvswitch: Simplify labels length logic. openvswitch: Refactor labels initialization. openvswitch: Inherit master's labels. openvswitch: Add original direction conntrack tuple to sw_flow_key. openvswitch: Add force commit. openvswitch: Pack struct sw_flow_key. include/uapi/linux/openvswitch.h | 33 - net/openvswitch/actions.c| 2 + net/openvswitch/conntrack.c | 298 ++- net/openvswitch/conntrack.h | 14 +- net/openvswitch/flow.c | 34 - net/openvswitch/flow.h | 55 ++-- net/openvswitch/flow_netlink.c | 92 +--- net/openvswitch/flow_netlink.h | 7 +- 8 files changed, 422 insertions(+), 113 deletions(-) -- 2.1.4
[PATCH v3 net-next 06/10] openvswitch: Refactor labels initialization.
Refactoring conntrack labels initialization makes chenges in later patches easier to review. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 104 ++-- 1 file changed, 62 insertions(+), 42 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index c7db4da..0e038ee 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -229,19 +229,12 @@ int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb) return 0; } -static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, +static int ovs_ct_set_mark(struct nf_conn *ct, struct sw_flow_key *key, u32 ct_mark, u32 mask) { #if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) - enum ip_conntrack_info ctinfo; - struct nf_conn *ct; u32 new_mark; - /* The connection could be invalid, in which case set_mark is no-op. */ - ct = nf_ct_get(skb, ); - if (!ct) - return 0; - new_mark = ct_mark | (ct->mark & ~(mask)); if (ct->mark != new_mark) { ct->mark = new_mark; @@ -256,50 +249,66 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, #endif } -static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, -const struct ovs_key_ct_labels *labels, -const struct ovs_key_ct_labels *mask) +static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct) { - enum ip_conntrack_info ctinfo; struct nf_conn_labels *cl; - struct nf_conn *ct; - - /* The connection could be invalid, in which case set_label is no-op.*/ - ct = nf_ct_get(skb, ); - if (!ct) - return 0; cl = nf_ct_labels_find(ct); if (!cl) { nf_ct_labels_ext_add(ct); cl = nf_ct_labels_find(ct); } + + return cl; +} + +/* Initialize labels for a new, to be committed conntrack entry. Note that + * since the new connection is not yet confirmed, and thus no-one else has + * access to it's labels, we simply write them over. Also, we refrain from + * triggering events, as receiving change events before the create event would + * be confusing. + */ +static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, + const struct ovs_key_ct_labels *labels, + const struct ovs_key_ct_labels *mask) +{ + struct nf_conn_labels *cl; + u32 *dst; + int i; + + cl = ovs_ct_get_conn_labels(ct); if (!cl) return -ENOSPC; - if (nf_ct_is_confirmed(ct)) { - /* Triggers a change event, which makes sense only for -* confirmed connections. -*/ - int err = nf_connlabels_replace(ct, labels->ct_labels_32, - mask->ct_labels_32, - OVS_CT_LABELS_LEN_32); - if (err) - return err; - } else { - u32 *dst = (u32 *)cl->bits; - const u32 *msk = mask->ct_labels_32; - const u32 *lbl = labels->ct_labels_32; - int i; + dst = (u32 *)cl->bits; + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | + (labels->ct_labels_32[i] & mask->ct_labels_32[i]); - /* No-one else has access to the non-confirmed entry, copy -* labels over, keeping any bits we are not explicitly setting. -*/ - for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) - dst[i] = (dst[i] & ~msk[i]) | (lbl[i] & msk[i]); - } + memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); + + return 0; +} + +static int ovs_ct_set_labels(struct nf_conn *ct, struct sw_flow_key *key, +const struct ovs_key_ct_labels *labels, +const struct ovs_key_ct_labels *mask) +{ + struct nf_conn_labels *cl; + int err; + + cl = ovs_ct_get_conn_labels(ct); + if (!cl) + return -ENOSPC; + + err = nf_connlabels_replace(ct, labels->ct_labels_32, + mask->ct_labels_32, + OVS_CT_LABELS_LEN_32); + if (err) + return err; + + memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); - ovs_ct_get_labels(ct, >ct.labels); return 0; } @@ -879,25 +888,36 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, const struct ovs_conntrack_info *info, struct sk_buff *skb) { +
[PATCH v3 net-next 03/10] openvswitch: Do not trigger events for unconfirmed connections.
Receiving change events before the 'new' event for the connection has been received can be confusing. Avoid triggering change events for setting conntrack mark or labels before the conntrack entry has been confirmed. Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark") Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 4df9a54..a6ff374 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -245,7 +245,8 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, new_mark = ct_mark | (ct->mark & ~(mask)); if (ct->mark != new_mark) { ct->mark = new_mark; - nf_conntrack_event_cache(IPCT_MARK, ct); + if (nf_ct_is_confirmed(ct)) + nf_conntrack_event_cache(IPCT_MARK, ct); key->ct.mark = new_mark; } @@ -262,7 +263,6 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, enum ip_conntrack_info ctinfo; struct nf_conn_labels *cl; struct nf_conn *ct; - int err; /* The connection could be invalid, in which case set_label is no-op.*/ ct = nf_ct_get(skb, ); @@ -277,10 +277,26 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN) return -ENOSPC; - err = nf_connlabels_replace(ct, (u32 *)labels, (u32 *)mask, - OVS_CT_LABELS_LEN / sizeof(u32)); - if (err) - return err; + if (nf_ct_is_confirmed(ct)) { + /* Triggers a change event, which makes sense only for +* confirmed connections. +*/ + int err = nf_connlabels_replace(ct, (u32 *)labels, (u32 *)mask, + OVS_CT_LABELS_LEN / sizeof(u32)); + if (err) + return err; + } else { + u32 *dst = (u32 *)cl->bits; + const u32 *msk = (const u32 *)mask->ct_labels; + const u32 *lbl = (const u32 *)labels->ct_labels; + int i; + + /* No-one else has access to the non-confirmed entry, copy +* labels over, keeping any bits we are not explicitly setting. +*/ + for (i = 0; i < OVS_CT_LABELS_LEN / sizeof(u32); i++) + dst[i] = (dst[i] & ~msk[i]) | (lbl[i] & msk[i]); + } ovs_ct_get_labels(ct, >ct.labels); return 0; -- 2.1.4
[PATCH v3 net-next 01/10] openvswitch: Fix comments for skb->_nfct
Fix comments referring to skb 'nfct' and 'nfctinfo' fields now that they are combined into '_nfct'. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index fbffe0e..5de6d12 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -157,7 +157,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, ovs_ct_get_labels(ct, >ct.labels); } -/* Update 'key' based on skb->nfct. If 'post_ct' is true, then OVS has +/* Update 'key' based on skb->_nfct. If 'post_ct' is true, then OVS has * previously sent the packet to conntrack via the ct action. If * 'keep_nat_flags' is true, the existing NAT flags retained, else they are * initialized from the connection status. @@ -421,12 +421,12 @@ ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h) /* Find an existing connection which this packet belongs to without * re-attributing statistics or modifying the connection state. This allows an - * skb->nfct lost due to an upcall to be recovered during actions execution. + * skb->_nfct lost due to an upcall to be recovered during actions execution. * * Must be called with rcu_read_lock. * - * On success, populates skb->nfct and skb->nfctinfo, and returns the - * connection. Returns NULL if there is no existing entry. + * On success, populates skb->_nfct and returns the connection. Returns NULL + * if there is no existing entry. */ static struct nf_conn * ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, @@ -464,7 +464,7 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, return ct; } -/* Determine whether skb->nfct is equal to the result of conntrack lookup. */ +/* Determine whether skb->_nfct is equal to the result of conntrack lookup. */ static bool skb_nfct_cached(struct net *net, const struct sw_flow_key *key, const struct ovs_conntrack_info *info, @@ -475,7 +475,7 @@ static bool skb_nfct_cached(struct net *net, ct = nf_ct_get(skb, ); /* If no ct, check if we have evidence that an existing conntrack entry -* might be found for this skb. This happens when we lose a skb->nfct +* might be found for this skb. This happens when we lose a skb->_nfct * due to an upcall. If the connection was not confirmed, it is not * cached and needs to be run through conntrack again. */ @@ -699,7 +699,7 @@ static int ovs_ct_nat(struct net *net, struct sw_flow_key *key, /* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if * not done already. Update key with new CT state after passing the packet * through conntrack. - * Note that if the packet is deemed invalid by conntrack, skb->nfct will be + * Note that if the packet is deemed invalid by conntrack, skb->_nfct will be * set to NULL and 0 will be returned. */ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, -- 2.1.4
[PATCH v3 net-next 09/10] openvswitch: Add force commit.
Stateful network admission policy may allow connections to one direction and reject connections initiated in the other direction. After policy change it is possible that for a new connection an overlapping conntrack entry already exists, where the original direction of the existing connection is opposed to the new connection's initial packet. Most importantly, conntrack state relating to the current packet gets the "reply" designation based on whether the original direction tuple or the reply direction tuple matched. If this "directionality" is wrong w.r.t. to the stateful network admission policy it may happen that packets in neither direction are correctly admitted. This patch adds a new "force commit" option to the OVS conntrack action that checks the original direction of an existing conntrack entry. If that direction is opposed to the current packet, the existing conntrack entry is deleted and a new one is subsequently created in the correct direction. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 5 + net/openvswitch/conntrack.c | 26 -- 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 90af8b8..7f41f7d 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -674,6 +674,10 @@ struct ovs_action_hash { * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG. * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address * translation (NAT) on the packet. + * @OVS_CT_ATTR_FORCE_COMMIT: Like %OVS_CT_ATTR_COMMIT, but instead of doing + * nothing if the connection is already committed will check that the current + * packet is in conntrack entry's original direction. If directionality does + * not match, will delete the existing conntrack entry and commit a new one. */ enum ovs_ct_attr { OVS_CT_ATTR_UNSPEC, @@ -684,6 +688,7 @@ enum ovs_ct_attr { OVS_CT_ATTR_HELPER, /* netlink helper to assist detection of related connections. */ OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ + OVS_CT_ATTR_FORCE_COMMIT, /* No argument */ __OVS_CT_ATTR_MAX }; diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 8685bcd..de47782 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -65,6 +65,7 @@ struct ovs_conntrack_info { struct nf_conn *ct; u8 commit : 1; u8 nat : 3; /* enum ovs_ct_nat */ + u8 force : 1; u16 family; struct md_mark mark; struct md_labels labels; @@ -615,10 +616,13 @@ static bool skb_nfct_cached(struct net *net, */ if (!ct && key->ct.state & OVS_CS_F_TRACKED && !(key->ct.state & OVS_CS_F_INVALID) && - key->ct.zone == info->zone.id) + key->ct.zone == info->zone.id) { ct = ovs_ct_find_existing(net, >zone, info->family, skb, !!(key->ct.state & OVS_CS_F_NAT_MASK)); + if (ct) + nf_ct_get(skb, ); + } if (!ct) return false; if (!net_eq(net, read_pnet(>ct_net))) @@ -632,6 +636,18 @@ static bool skb_nfct_cached(struct net *net, if (help && rcu_access_pointer(help->helper) != info->helper) return false; } + /* Force conntrack entry direction to the current packet? */ + if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) { + /* Delete the conntrack entry if confirmed, else just release +* the reference. +*/ + if (nf_ct_is_confirmed(ct)) + nf_ct_delete(ct, 0, 0); + else + nf_conntrack_put(>ct_general); + nf_ct_set(skb, NULL, 0); + return false; + } return true; } @@ -1209,6 +1225,7 @@ static int parse_nat(const struct nlattr *attr, static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = { [OVS_CT_ATTR_COMMIT]= { .minlen = 0, .maxlen = 0 }, + [OVS_CT_ATTR_FORCE_COMMIT] = { .minlen = 0, .maxlen = 0 }, [OVS_CT_ATTR_ZONE] = { .minlen = sizeof(u16), .maxlen = sizeof(u16) }, [OVS_CT_ATTR_MARK] = { .minlen = sizeof(struct md_mark), @@ -1248,6 +1265,9 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, } switch (type) { + case OVS_CT_ATTR_FORCE_COMMIT: + info->force = true; + /* fall throug
[PATCH v3 net-next 02/10] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
The conntrack lookup for existing connections fails to invert the packet 5-tuple for NATted packets, and therefore fails to find the existing conntrack entry. Conntrack only stores 5-tuples for incoming packets, and there are various situations where a lookup on a packet that has already been transformed by NAT needs to be made. Looking up an existing conntrack entry upon executing packet received from the userspace is one of them. This patch fixes ovs_ct_find_existing() to invert the packet 5-tuple for the conntrack lookup whenever the packet has already been transformed by conntrack from its input form as evidenced by one of the NAT flags being set in the conntrack state metadata. Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 5de6d12..4df9a54 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -430,7 +430,7 @@ ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h) */ static struct nf_conn * ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, -u8 l3num, struct sk_buff *skb) +u8 l3num, struct sk_buff *skb, bool natted) { struct nf_conntrack_l3proto *l3proto; struct nf_conntrack_l4proto *l4proto; @@ -453,6 +453,17 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, return NULL; } + /* Must invert the tuple if skb has been transformed by NAT. */ + if (natted) { + struct nf_conntrack_tuple inverse; + + if (!nf_ct_invert_tuple(, , l3proto, l4proto)) { + pr_debug("ovs_ct_find_existing: Inversion failed!\n"); + return NULL; + } + tuple = inverse; + } + /* look for tuple match */ h = nf_conntrack_find_get(net, zone, ); if (!h) @@ -460,6 +471,13 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, ct = nf_ct_tuplehash_to_ctrack(h); + /* Inverted packet tuple matches the reverse direction conntrack tuple, +* select the other tuplehash to get the right 'ctinfo' bits for this +* packet. +*/ + if (natted) + h = >tuplehash[!h->tuple.dst.dir]; + nf_ct_set(skb, ct, ovs_ct_get_info(h)); return ct; } @@ -482,7 +500,9 @@ static bool skb_nfct_cached(struct net *net, if (!ct && key->ct.state & OVS_CS_F_TRACKED && !(key->ct.state & OVS_CS_F_INVALID) && key->ct.zone == info->zone.id) - ct = ovs_ct_find_existing(net, >zone, info->family, skb); + ct = ovs_ct_find_existing(net, >zone, info->family, skb, + !!(key->ct.state +& OVS_CS_F_NAT_MASK)); if (!ct) return false; if (!net_eq(net, read_pnet(>ct_net))) -- 2.1.4
[PATCH v3 net-next 08/10] openvswitch: Add original direction conntrack tuple to sw_flow_key.
Add the fields of the conntrack original direction 5-tuple to struct sw_flow_key. The new fields are initially marked as non-existent, and are populated whenever a conntrack action is executed and either finds or generates a conntrack entry. This means that these fields exist for all packets that were not rejected by conntrack as untrackable. The original tuple fields in the sw_flow_key are filled from the original direction tuple of the conntrack entry relating to the current packet, or from the original direction tuple of the master conntrack entry, if the current conntrack entry has a master. Generally, expected connections of connections having an assigned helper (e.g., FTP), have a master conntrack entry. The main purpose of the new conntrack original tuple fields is to allow matching on them for policy decision purposes, with the premise that the admissibility of tracked connections reply packets (as well as original direction packets), and both direction packets of any related connections may be based on ACL rules applying to the master connection's original direction 5-tuple. This also makes it easier to make policy decisions when the actual packet headers might have been transformed by NAT, as the original direction 5-tuple represents the packet headers before any such transformation. When using the original direction 5-tuple the admissibility of return and/or related packets need not be based on the mere existence of a conntrack entry, allowing separation of admission policy from the established conntrack state. While existence of a conntrack entry is required for admission of the return or related packets, policy changes can render connections that were initially admitted to be rejected or dropped afterwards. If the admission of the return and related packets was based on mere conntrack state (e.g., connection being in an established state), a policy change that would make the connection rejected or dropped would need to find and delete all conntrack entries affected by such a change. When using the original direction 5-tuple matching the affected conntrack entries can be allowed to time out instead, as the established state of the connection would not need to be the basis for packet admission any more. It should be noted that the directionality of related connections may be the same or different than that of the master connection, and neither the original direction 5-tuple nor the conntrack state bits carry this information. If needed, the directionality of the master connection can be stored in master's conntrack mark or labels, which are automatically inherited by the expected related connections. The fact that neither ARP not ND packets are trackable by conntrack allows mutual exclusion between ARP/ND and the new conntrack original tuple fields. Hence, the IP addresses are overlaid in union with ARP and ND fields. This allows the sw_flow_key to not grow much due to this patch, but it also means that we must be careful to never use the new key fields with ARP or ND packets. ARP is easy to distinguish and keep mutually exclusive based on the ethernet type, but ND being an ICMPv6 protocol requires a bit more attention. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 20 +- net/openvswitch/actions.c| 2 + net/openvswitch/conntrack.c | 86 +--- net/openvswitch/conntrack.h | 10 - net/openvswitch/flow.c | 34 +--- net/openvswitch/flow.h | 49 ++- net/openvswitch/flow_netlink.c | 85 +-- net/openvswitch/flow_netlink.h | 7 +++- 8 files changed, 246 insertions(+), 47 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 96aee34..90af8b8 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -1,6 +1,6 @@ /* - * Copyright (c) 2007-2013 Nicira, Inc. + * Copyright (c) 2007-2017 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -331,6 +331,8 @@ enum ovs_key_attr { OVS_KEY_ATTR_CT_ZONE, /* u16 connection tracking zone. */ OVS_KEY_ATTR_CT_MARK, /* u32 connection tracking mark */ OVS_KEY_ATTR_CT_LABELS, /* 16-octet connection tracking label */ + OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, /* struct ovs_key_ct_tuple_ipv4 */ + OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, /* struct ovs_key_ct_tuple_ipv6 */ #ifdef __KERNEL__ OVS_KEY_ATTR_TUNNEL_INFO, /* struct ip_tunnel_info */ @@ -472,6 +474,22 @@ struct ovs_key_ct_labels { #define OVS_CS_F_NAT_MASK (OVS_CS_F_SRC_NAT | OVS_CS_F_DST_NAT) +struct ovs_key_ct_tuple_ipv4 { + __be32 ipv4_src; + __be32 ipv4_dst; + __be16 src_port; + __be16 dst_port; + __u8 ipv4_proto; +}; + +
[PATCH v2 net-next 3/9] openvswitch: Simplify labels length logic.
Since 23014011ba42 ("netfilter: conntrack: support a fixed size of 128 distinct labels"), the size of conntrack labels extension has fixed to 128 bits, so we do not need to check for labels sizes shorter than 128 at run-time. This patch simplifies labels length logic accordingly, but allows the conntrack labels size to be increased in the future without breaking the build. In the event of conntrack labels increasing in size OVS would still be able to deal with the 128 first label bits. Suggested-by: Joe Stringer <j...@ovn.org> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 6730f09..a07e5cd 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -129,22 +129,22 @@ static u32 ovs_ct_get_mark(const struct nf_conn *ct) #endif } +/* Guard against conntrack labels max size shrinking below 128 bits. */ +#if NF_CT_LABELS_MAX_SIZE < 16 +#error NF_CT_LABELS_MAX_SIZE must be at least 16 bytes +#endif + static void ovs_ct_get_labels(const struct nf_conn *ct, struct ovs_key_ct_labels *labels) { struct nf_conn_labels *cl = ct ? nf_ct_labels_find(ct) : NULL; - if (cl) { - size_t len = sizeof(cl->bits); - - if (len > OVS_CT_LABELS_LEN) - len = OVS_CT_LABELS_LEN; - else if (len < OVS_CT_LABELS_LEN) - memset(labels, 0, OVS_CT_LABELS_LEN); - memcpy(labels, cl->bits, len); - } else { + if (cl) + memcpy(labels, cl->bits, + sizeof(cl->bits) > OVS_CT_LABELS_LEN + ? OVS_CT_LABELS_LEN : sizeof(cl->bits)); + else memset(labels, 0, OVS_CT_LABELS_LEN); - } } static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, @@ -274,7 +274,7 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, nf_ct_labels_ext_add(ct); cl = nf_ct_labels_find(ct); } - if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN) + if (!cl) return -ENOSPC; err = nf_connlabels_replace(ct, labels->ct_labels_32, -- 2.1.4
[PATCH v2 net-next 4/9] openvswitch: Do not trigger events for unconfirmed connections.
Receiving change events before the 'new' event for the connection has been received can be confusing. Avoid triggering change events for setting conntrack mark or labels before the conntrack entry has been confirmed. Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark") Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 30 +++--- 1 file changed, 23 insertions(+), 7 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index a07e5cd..6e3e5e7 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -245,7 +245,8 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, new_mark = ct_mark | (ct->mark & ~(mask)); if (ct->mark != new_mark) { ct->mark = new_mark; - nf_conntrack_event_cache(IPCT_MARK, ct); + if (nf_ct_is_confirmed(ct)) + nf_conntrack_event_cache(IPCT_MARK, ct); key->ct.mark = new_mark; } @@ -262,7 +263,6 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, enum ip_conntrack_info ctinfo; struct nf_conn_labels *cl; struct nf_conn *ct; - int err; /* The connection could be invalid, in which case set_label is no-op.*/ ct = nf_ct_get(skb, ); @@ -277,11 +277,27 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, if (!cl) return -ENOSPC; - err = nf_connlabels_replace(ct, labels->ct_labels_32, - mask->ct_labels_32, - OVS_CT_LABELS_LEN_32); - if (err) - return err; + if (nf_ct_is_confirmed(ct)) { + /* Triggers a change event, which makes sense only for +* confirmed connections. +*/ + int err = nf_connlabels_replace(ct, labels->ct_labels_32, + mask->ct_labels_32, + OVS_CT_LABELS_LEN_32); + if (err) + return err; + } else { + u32 *dst = (u32 *)cl->bits; + int i; + + /* No-one else has access to the non-confirmed entry, copy +* labels over, keeping any bits we are not explicitly setting. +*/ + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | + (labels->ct_labels_32[i] & +mask->ct_labels_32[i]); + } ovs_ct_get_labels(ct, >ct.labels); return 0; -- 2.1.4
[PATCH v2 net-next 0/9] openvswitch: Conntrack integration improvements.
This series improves the conntrack integration code in the openvswitch module by fixing bugs (patches 1, 4, and 6), clarifying code (patches 2, 3, and 5), improving performance (patch 9), and adding new features enabling better translation from firewall admission policy to network configuration requested by user communities (patches 7 and 8). Jarno Rajahalme (9): openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted. openvswitch: Unionize ovs_key_ct_label with a u32 array. openvswitch: Simplify labels length logic. openvswitch: Do not trigger events for unconfirmed connections. openvswitch: Refactor labels initialization. openvswitch: Inherit master's labels. openvswitch: Add original direction conntrack tuple to sw_flow_key. openvswitch: Add force commit. openvswitch: Pack struct sw_flow_key. include/uapi/linux/openvswitch.h | 33 - net/openvswitch/actions.c| 2 + net/openvswitch/conntrack.c | 285 +++ net/openvswitch/conntrack.h | 14 +- net/openvswitch/flow.c | 34 - net/openvswitch/flow.h | 55 ++-- net/openvswitch/flow_netlink.c | 92 ++--- net/openvswitch/flow_netlink.h | 7 +- 8 files changed, 416 insertions(+), 106 deletions(-) -- 2.1.4
[PATCH v2 net-next 1/9] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
The conntrack lookup for existing connections fails to invert the packet 5-tuple for NATted packets, and therefore fails to find the existing conntrack entry. Conntrack only stores 5-tuples for incoming packets, and there are various situations where a lookup on a packet that has already been transformed by NAT needs to be made. Looking up an existing conntrack entry upon executing packet received from the userspace is one of them. This patch fixes ovs_ct_find_existing() to invert the packet 5-tuple for the conntrack lookup whenever the packet has already been transformed by conntrack from its input form as evidenced by one of the NAT flags being set in the conntrack state metadata. Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 54253ea..b91baa2 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -430,7 +430,7 @@ ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h) */ static struct nf_conn * ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, -u8 l3num, struct sk_buff *skb) +u8 l3num, struct sk_buff *skb, bool natted) { struct nf_conntrack_l3proto *l3proto; struct nf_conntrack_l4proto *l4proto; @@ -453,6 +453,17 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, return NULL; } + /* Must invert the tuple if skb has been transformed by NAT. */ + if (natted) { + struct nf_conntrack_tuple inverse; + + if (!nf_ct_invert_tuple(, , l3proto, l4proto)) { + pr_debug("ovs_ct_find_existing: Inversion failed!\n"); + return NULL; + } + tuple = inverse; + } + /* look for tuple match */ h = nf_conntrack_find_get(net, zone, ); if (!h) @@ -460,6 +471,13 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, ct = nf_ct_tuplehash_to_ctrack(h); + /* Inverted packet tuple matches the reverse direction conntrack tuple, +* select the other tuplehash to get the right 'ctinfo' bits for this +* packet. +*/ + if (natted) + h = >tuplehash[!h->tuple.dst.dir]; + skb->nfct = >ct_general; skb->nfctinfo = ovs_ct_get_info(h); return ct; @@ -483,7 +501,9 @@ static bool skb_nfct_cached(struct net *net, if (!ct && key->ct.state & OVS_CS_F_TRACKED && !(key->ct.state & OVS_CS_F_INVALID) && key->ct.zone == info->zone.id) - ct = ovs_ct_find_existing(net, >zone, info->family, skb); + ct = ovs_ct_find_existing(net, >zone, info->family, skb, + !!(key->ct.state +& OVS_CS_F_NAT_MASK)); if (!ct) return false; if (!net_eq(net, read_pnet(>ct_net))) -- 2.1.4
[PATCH v2 net-next 2/9] openvswitch: Unionize ovs_key_ct_label with a u32 array.
Make the array of labels in struct ovs_key_ct_label an union, adding a u32 array of the same byte size as the existing u8 array. It is faster to loop through the labels 32 bits at the time, which is also the alignment of netlink attributes. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 8 ++-- net/openvswitch/conntrack.c | 9 + 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 375d812..96aee34 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -446,9 +446,13 @@ struct ovs_key_nd { __u8nd_tll[ETH_ALEN]; }; -#define OVS_CT_LABELS_LEN 16 +#define OVS_CT_LABELS_LEN_32 4 +#define OVS_CT_LABELS_LEN (OVS_CT_LABELS_LEN_32 * sizeof(__u32)) struct ovs_key_ct_labels { - __u8ct_labels[OVS_CT_LABELS_LEN]; + union { + __u8ct_labels[OVS_CT_LABELS_LEN]; + __u32 ct_labels_32[OVS_CT_LABELS_LEN_32]; + }; }; /* OVS_KEY_ATTR_CT_STATE flags */ diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index b91baa2..6730f09 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -277,8 +277,9 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN) return -ENOSPC; - err = nf_connlabels_replace(ct, (u32 *)labels, (u32 *)mask, - OVS_CT_LABELS_LEN / sizeof(u32)); + err = nf_connlabels_replace(ct, labels->ct_labels_32, + mask->ct_labels_32, + OVS_CT_LABELS_LEN_32); if (err) return err; @@ -852,8 +853,8 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels) { size_t i; - for (i = 0; i < sizeof(*labels); i++) - if (labels->ct_labels[i]) + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + if (labels->ct_labels_32[i]) return true; return false; -- 2.1.4
[PATCH v2 net-next 7/9] openvswitch: Add original direction conntrack tuple to sw_flow_key.
Add the fields of the conntrack original direction 5-tuple to struct sw_flow_key. The new fields are initially marked as non-existent, and are populated whenever a conntrack action is executed and either finds or generates a conntrack entry. This means that these fields exist for all packets that were not rejected by conntrack as untrackable. The original tuple fields in the sw_flow_key are filled from the original direction tuple of the conntrack entry relating to the current packet, or from the original direction tuple of the master conntrack entry, if the current conntrack entry has a master. Generally, expected connections of connections having an assigned helper (e.g., FTP), have a master conntrack entry. The main purpose of the new conntrack original tuple fields is to allow matching on them for policy decision purposes, with the premise that the admissibility of tracked connections reply packets (as well as original direction packets), and both direction packets of any related connections may be based on ACL rules applying to the master connection's original direction 5-tuple. This also makes it easier to make policy decisions when the actual packet headers might have been transformed by NAT, as the original direction 5-tuple represents the packet headers before any such transformation. When using the original direction 5-tuple the admissibility of return and/or related packets need not be based on the mere existence of a conntrack entry, allowing separation of admission policy from the established conntrack state. While existence of a conntrack entry is required for admission of the return or related packets, policy changes can render connections that were initially admitted to be rejected or dropped afterwards. If the admission of the return and related packets was based on mere conntrack state (e.g., connection being in an established state), a policy change that would make the connection rejected or dropped would need to find and delete all conntrack entries affected by such a change. When using the original direction 5-tuple matching the affected conntrack entries can be allowed to time out instead, as the established state of the connection would not need to be the basis for packet admission any more. It should be noted that the directionality of related connections may be the same or different than that of the master connection, and neither the original direction 5-tuple nor the conntrack state bits carry this information. If needed, the directionality of the master connection can be stored in master's conntrack mark or labels, which are automatically inherited by the expected related connections. The fact that neither ARP not ND packets are trackable by conntrack allows mutual exclusion between ARP/ND and the new conntrack original tuple fields. Hence, the IP addresses are overlaid in union with ARP and ND fields. This allows the sw_flow_key to not grow much due to this patch, but it also means that we must be careful to never use the new key fields with ARP or ND packets. ARP is easy to distinguish and keep mutually exclusive based on the ethernet type, but ND being an ICMPv6 protocol requires a bit more attention. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 20 +- net/openvswitch/actions.c| 2 + net/openvswitch/conntrack.c | 86 +--- net/openvswitch/conntrack.h | 10 - net/openvswitch/flow.c | 34 +--- net/openvswitch/flow.h | 49 ++- net/openvswitch/flow_netlink.c | 85 +-- net/openvswitch/flow_netlink.h | 7 +++- 8 files changed, 246 insertions(+), 47 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 96aee34..90af8b8 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -1,6 +1,6 @@ /* - * Copyright (c) 2007-2013 Nicira, Inc. + * Copyright (c) 2007-2017 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -331,6 +331,8 @@ enum ovs_key_attr { OVS_KEY_ATTR_CT_ZONE, /* u16 connection tracking zone. */ OVS_KEY_ATTR_CT_MARK, /* u32 connection tracking mark */ OVS_KEY_ATTR_CT_LABELS, /* 16-octet connection tracking label */ + OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, /* struct ovs_key_ct_tuple_ipv4 */ + OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, /* struct ovs_key_ct_tuple_ipv6 */ #ifdef __KERNEL__ OVS_KEY_ATTR_TUNNEL_INFO, /* struct ip_tunnel_info */ @@ -472,6 +474,22 @@ struct ovs_key_ct_labels { #define OVS_CS_F_NAT_MASK (OVS_CS_F_SRC_NAT | OVS_CS_F_DST_NAT) +struct ovs_key_ct_tuple_ipv4 { + __be32 ipv4_src; + __be32 ipv4_dst; + __be16 src_port; + __be16 dst_port; + __u8 ipv4_proto; +}; + +
[PATCH v2 net-next 9/9] openvswitch: Pack struct sw_flow_key.
struct sw_flow_key has two 16-bit holes. Move the most matched conntrack match fields there. In some typical cases this reduces the size of the key that needs to be hashed into half and into one cache line. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c| 40 net/openvswitch/conntrack.h| 8 net/openvswitch/flow.h | 14 -- net/openvswitch/flow_netlink.c | 11 +++ 4 files changed, 39 insertions(+), 34 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index f15c855..6ae18bd 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -154,7 +154,7 @@ static void __ovs_ct_update_key_orig_tp(struct sw_flow_key *key, const struct nf_conntrack_tuple *orig, u8 icmp_proto) { - key->ct.orig_proto = orig->dst.protonum; + key->ct_orig_proto = orig->dst.protonum; if (orig->dst.protonum == icmp_proto) { key->ct.orig_tp.src = htons(orig->dst.u.icmp.type); key->ct.orig_tp.dst = htons(orig->dst.u.icmp.code); @@ -168,8 +168,8 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, const struct nf_conntrack_zone *zone, const struct nf_conn *ct) { - key->ct.state = state; - key->ct.zone = zone->id; + key->ct_state = state; + key->ct_zone = zone->id; key->ct.mark = ovs_ct_get_mark(ct); ovs_ct_get_labels(ct, >ct.labels); @@ -197,10 +197,10 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, return; } } - /* Clear 'ct.orig_proto' to mark the non-existence of conntrack + /* Clear 'ct_orig_proto' to mark the non-existence of conntrack * original direction key fields. */ - key->ct.orig_proto = 0; + key->ct_orig_proto = 0; } /* Update 'key' based on skb->nfct. If 'post_ct' is true, then OVS has @@ -230,7 +230,7 @@ static void ovs_ct_update_key(const struct sk_buff *skb, if (ct->master) state |= OVS_CS_F_RELATED; if (keep_nat_flags) { - state |= key->ct.state & OVS_CS_F_NAT_MASK; + state |= key->ct_state & OVS_CS_F_NAT_MASK; } else { if (ct->status & IPS_SRC_NAT) state |= OVS_CS_F_SRC_NAT; @@ -261,11 +261,11 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key) int ovs_ct_put_key(const struct sw_flow_key *swkey, const struct sw_flow_key *output, struct sk_buff *skb) { - if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct.state)) + if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct_state)) return -EMSGSIZE; if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) && - nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct.zone)) + nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct_zone)) return -EMSGSIZE; if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) && @@ -277,14 +277,14 @@ int ovs_ct_put_key(const struct sw_flow_key *swkey, >ct.labels)) return -EMSGSIZE; - if (swkey->ct.orig_proto) { + if (swkey->ct_orig_proto) { if (swkey->eth.type == htons(ETH_P_IP)) { struct ovs_key_ct_tuple_ipv4 orig = { output->ipv4.ct_orig.src, output->ipv4.ct_orig.dst, output->ct.orig_tp.src, output->ct.orig_tp.dst, - output->ct.orig_proto, + output->ct_orig_proto, }; if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, sizeof(orig), )) @@ -295,7 +295,7 @@ int ovs_ct_put_key(const struct sw_flow_key *swkey, IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.dst), output->ct.orig_tp.src, output->ct.orig_tp.dst, - output->ct.orig_proto, + output->ct_orig_proto, }; if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, sizeof(orig), )) @@ -615,11 +615,11 @@ static bool skb_nfct_cached(struct net *net, * due to an upcall. If the connection was not confirmed, it is not * cached and needs to be run thro
[PATCH v2 net-next 5/9] openvswitch: Refactor labels initialization.
Refactoring conntrack labels initialization makes chenges in later patches easier to review. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 104 ++-- 1 file changed, 62 insertions(+), 42 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 6e3e5e7..026f4a9 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -229,19 +229,12 @@ int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb) return 0; } -static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, +static int ovs_ct_set_mark(struct nf_conn *ct, struct sw_flow_key *key, u32 ct_mark, u32 mask) { #if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) - enum ip_conntrack_info ctinfo; - struct nf_conn *ct; u32 new_mark; - /* The connection could be invalid, in which case set_mark is no-op. */ - ct = nf_ct_get(skb, ); - if (!ct) - return 0; - new_mark = ct_mark | (ct->mark & ~(mask)); if (ct->mark != new_mark) { ct->mark = new_mark; @@ -256,50 +249,66 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, #endif } -static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, -const struct ovs_key_ct_labels *labels, -const struct ovs_key_ct_labels *mask) +static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct) { - enum ip_conntrack_info ctinfo; struct nf_conn_labels *cl; - struct nf_conn *ct; - - /* The connection could be invalid, in which case set_label is no-op.*/ - ct = nf_ct_get(skb, ); - if (!ct) - return 0; cl = nf_ct_labels_find(ct); if (!cl) { nf_ct_labels_ext_add(ct); cl = nf_ct_labels_find(ct); } + + return cl; +} + +/* Initialize labels for a new, to be committed conntrack entry. Note that + * since the new connection is not yet confirmed, and thus no-one else has + * access to it's labels, we simply write them over. Also, we refrain from + * triggering events, as receiving change events before the create event would + * be confusing. + */ +static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, + const struct ovs_key_ct_labels *labels, + const struct ovs_key_ct_labels *mask) +{ + struct nf_conn_labels *cl; + u32 *dst; + int i; + + cl = ovs_ct_get_conn_labels(ct); if (!cl) return -ENOSPC; - if (nf_ct_is_confirmed(ct)) { - /* Triggers a change event, which makes sense only for -* confirmed connections. -*/ - int err = nf_connlabels_replace(ct, labels->ct_labels_32, - mask->ct_labels_32, - OVS_CT_LABELS_LEN_32); - if (err) - return err; - } else { - u32 *dst = (u32 *)cl->bits; - int i; + dst = (u32 *)cl->bits; + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | + (labels->ct_labels_32[i] & mask->ct_labels_32[i]); - /* No-one else has access to the non-confirmed entry, copy -* labels over, keeping any bits we are not explicitly setting. -*/ - for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) - dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | - (labels->ct_labels_32[i] & -mask->ct_labels_32[i]); - } + memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); + + return 0; +} + +static int ovs_ct_set_labels(struct nf_conn *ct, struct sw_flow_key *key, +const struct ovs_key_ct_labels *labels, +const struct ovs_key_ct_labels *mask) +{ + struct nf_conn_labels *cl; + int err; + + cl = ovs_ct_get_conn_labels(ct); + if (!cl) + return -ENOSPC; + + err = nf_connlabels_replace(ct, labels->ct_labels_32, + mask->ct_labels_32, + OVS_CT_LABELS_LEN_32); + if (err) + return err; + + memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); - ovs_ct_get_labels(ct, >ct.labels); return 0; } @@ -881,25 +890,36 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, const struct ovs_conntrack_info *info, struct sk_buff *skb) { +
[PATCH v2 net-next 6/9] openvswitch: Inherit master's labels.
We avoid calling into nf_conntrack_in() for expected connections, as that would remove the expectation that we want to stick around until we are ready to commit the connection. Instead, we do a lookup in the expectation table directly. However, after a successful expectation lookup we have set the flow key label field from the master connection, whereas nf_conntrack_in() does not do this. This leads to master's labels being inherited after an expectation lookup, but those labels not being inherited after the corresponding conntrack action with a commit flag. This patch resolves the problem by changing the commit code path to also inherit the master's labels to the expected connection. Resolving this conflict in favor or inheriting the labels allows more information be passed from the master connection to related connections, which would otherwise be much harder if the 32 bits in the connmark are not enough. Labels can still be set explicitly, so this change only affects the default values of the labels in presense of a master connection. Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 45 +++-- 1 file changed, 31 insertions(+), 14 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 026f4a9..5b408f9 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -73,6 +73,8 @@ struct ovs_conntrack_info { #endif }; +static bool labels_nonzero(const struct ovs_key_ct_labels *labels); + static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info); static u16 key_to_nfproto(const struct sw_flow_key *key) @@ -272,18 +274,32 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, const struct ovs_key_ct_labels *labels, const struct ovs_key_ct_labels *mask) { - struct nf_conn_labels *cl; - u32 *dst; - int i; + struct nf_conn_labels *cl, *master_cl; + bool have_mask = labels_nonzero(mask); + + /* Inherit master's labels to the related connection? */ + master_cl = (ct->master) ? nf_ct_labels_find(ct->master) : NULL; + + if (!master_cl && !have_mask) + return 0; /* Nothing to do. */ cl = ovs_ct_get_conn_labels(ct); if (!cl) return -ENOSPC; - dst = (u32 *)cl->bits; - for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) - dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | - (labels->ct_labels_32[i] & mask->ct_labels_32[i]); + /* Inherit the master's labels, if any. */ + if (master_cl) + *cl = *master_cl; + + if (have_mask) { + u32 *dst = (u32 *)cl->bits; + int i; + + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | + (labels->ct_labels_32[i] +& mask->ct_labels_32[i]); + } memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); @@ -913,13 +929,14 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, if (err) return err; } - if (labels_nonzero(>labels.mask)) { - if (!nf_ct_is_confirmed(ct)) - err = ovs_ct_init_labels(ct, key, >labels.value, ->labels.mask); - else - err = ovs_ct_set_labels(ct, key, >labels.value, - >labels.mask); + if (!nf_ct_is_confirmed(ct)) { + err = ovs_ct_init_labels(ct, key, >labels.value, +>labels.mask); + if (err) + return err; + } else if (labels_nonzero(>labels.mask)) { + err = ovs_ct_set_labels(ct, key, >labels.value, + >labels.mask); if (err) return err; } -- 2.1.4
[PATCH v2 net-next 8/9] openvswitch: Add force commit.
Stateful network admission policy may allow connections to one direction and reject connections initiated in the other direction. After policy change it is possible that for a new connection an overlapping conntrack entry already exists, where the original direction of the existing connection is opposed to the new connection's initial packet. Most importantly, conntrack state relating to the current packet gets the "reply" designation based on whether the original direction tuple or the reply direction tuple matched. If this "directionality" is wrong w.r.t. to the stateful network admission policy it may happen that packets in neither direction are correctly admitted. This patch adds a new "force commit" option to the OVS conntrack action that checks the original direction of an existing conntrack entry. If that direction is opposed to the current packet, the existing conntrack entry is deleted and a new one is subsequently created in the correct direction. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 5 + net/openvswitch/conntrack.c | 27 +-- 2 files changed, 30 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 90af8b8..7f41f7d 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -674,6 +674,10 @@ struct ovs_action_hash { * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG. * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address * translation (NAT) on the packet. + * @OVS_CT_ATTR_FORCE_COMMIT: Like %OVS_CT_ATTR_COMMIT, but instead of doing + * nothing if the connection is already committed will check that the current + * packet is in conntrack entry's original direction. If directionality does + * not match, will delete the existing conntrack entry and commit a new one. */ enum ovs_ct_attr { OVS_CT_ATTR_UNSPEC, @@ -684,6 +688,7 @@ enum ovs_ct_attr { OVS_CT_ATTR_HELPER, /* netlink helper to assist detection of related connections. */ OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ + OVS_CT_ATTR_FORCE_COMMIT, /* No argument */ __OVS_CT_ATTR_MAX }; diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 14c35b7..f15c855 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -65,6 +65,7 @@ struct ovs_conntrack_info { struct nf_conn *ct; u8 commit : 1; u8 nat : 3; /* enum ovs_ct_nat */ + u8 force : 1; u16 family; struct md_mark mark; struct md_labels labels; @@ -616,10 +617,13 @@ static bool skb_nfct_cached(struct net *net, */ if (!ct && key->ct.state & OVS_CS_F_TRACKED && !(key->ct.state & OVS_CS_F_INVALID) && - key->ct.zone == info->zone.id) + key->ct.zone == info->zone.id) { ct = ovs_ct_find_existing(net, >zone, info->family, skb, !!(key->ct.state & OVS_CS_F_NAT_MASK)); + if (ct) + nf_ct_get(skb, ); + } if (!ct) return false; if (!net_eq(net, read_pnet(>ct_net))) @@ -633,6 +637,19 @@ static bool skb_nfct_cached(struct net *net, if (help && rcu_access_pointer(help->helper) != info->helper) return false; } + /* Force conntrack entry direction to the current packet? */ + if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) { + /* Delete the conntrack entry if confirmed, else just release +* the reference. +*/ + if (nf_ct_is_confirmed(ct)) + nf_ct_delete(ct, 0, 0); + else + nf_conntrack_put(>ct_general); + skb->nfct = NULL; + skb->nfctinfo = 0; + return false; + } return true; } @@ -1211,6 +1228,7 @@ static int parse_nat(const struct nlattr *attr, static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = { [OVS_CT_ATTR_COMMIT]= { .minlen = 0, .maxlen = 0 }, + [OVS_CT_ATTR_FORCE_COMMIT] = { .minlen = 0, .maxlen = 0 }, [OVS_CT_ATTR_ZONE] = { .minlen = sizeof(u16), .maxlen = sizeof(u16) }, [OVS_CT_ATTR_MARK] = { .minlen = sizeof(struct md_mark), @@ -1250,6 +1268,9 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, } switch (type) { + case OVS_CT_ATTR_FORCE_COMMIT: + info->for
Re: [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key.
> On Feb 6, 2017, at 11:15 PM, Joe Stringer <j...@ovn.org> wrote: > > On 2 February 2017 at 17:10, Jarno Rajahalme <ja...@ovn.org> wrote: >> struct sw_flow_key has two 16-bit holes. Move the most matched >> conntrack match fields there. In some typical cases this reduces the >> size of the key that needs to be hashed into half and into one cache >> line. >> >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > > Looks like this misses the zeroing in ovs_nla_get_flow_metadata(); > might want to double-check for any other memset/copies of the key->ct > field. Good catch. Looked, there are no other places to change. Will rebase to current net-next and repost. Jarno
Re: [PATCH net-next 6/7] openvswitch: Add force commit.
> On Feb 7, 2017, at 2:15 PM, Joe Stringer <j...@ovn.org> wrote: > > On 2 February 2017 at 17:10, Jarno Rajahalme <ja...@ovn.org> wrote: >> Stateful network admission policy may allow connections to one >> direction and reject connections initiated in the other direction. >> After policy change it is possible that for a new connection an >> overlapping conntrack entry already exist, where the connection >> original direction is opposed to the new connection's initial packet. >> >> Most importantly, conntrack state relating to the current packet gets >> the "reply" designation based on whether the original direction tuple >> or the reply direction tuple matched. If this "directionality" is >> wrong w.r.t. to the stateful network admission policy it may happen >> that packets in neither direction are correctly admitted. >> >> This patch adds a new "force commit" option to the OVS conntrack >> action that checks the original direction of an existing conntrack >> entry. If that direction is opposed to the current packet, the >> existing conntrack entry is deleted and a new one is subsequently >> created in the correct direction. >> >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> >> --- >> include/uapi/linux/openvswitch.h | 10 ++ >> net/openvswitch/conntrack.c | 27 +-- >> 2 files changed, 35 insertions(+), 2 deletions(-) >> >> diff --git a/include/uapi/linux/openvswitch.h >> b/include/uapi/linux/openvswitch.h >> index 90af8b8..d5ba9a9 100644 >> --- a/include/uapi/linux/openvswitch.h >> +++ b/include/uapi/linux/openvswitch.h >> @@ -674,6 +674,10 @@ struct ovs_action_hash { >> * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG. >> * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address >> * translation (NAT) on the packet. >> + * @OVS_CT_ATTR_FORCE_COMMIT: Like %OVS_CT_ATTR_COMMIT, but instead of doing >> + * nothing if the connection is already committed will check that the >> current >> + * packet is in conntrack entry's original direction. If directionality >> does >> + * not match, will delete the existing conntrack entry and commit a new one. >> */ >> enum ovs_ct_attr { >>OVS_CT_ATTR_UNSPEC, >> @@ -684,6 +688,12 @@ enum ovs_ct_attr { >>OVS_CT_ATTR_HELPER, /* netlink helper to assist detection of >> related connections. */ >>OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ >> + OVS_CT_ATTR_FORCE_COMMIT, /* No argument, commits connection. If >> the >> + * conntrack entry original direction >> tuple >> + * does not match the current packet >> header >> + * values, will delete the current >> conntrack >> + * entry and create a new one. >> + */ > > We only need one copy of the explanation, keep it above the enum, then > the inline comment can be /* No argument */. > OK. >>__OVS_CT_ATTR_MAX >> }; >> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c >> index 1afe153..1f27f44 100644 >> --- a/net/openvswitch/conntrack.c >> +++ b/net/openvswitch/conntrack.c >> @@ -65,6 +65,7 @@ struct ovs_conntrack_info { >>struct nf_conn *ct; >>u8 commit : 1; >>u8 nat : 3; /* enum ovs_ct_nat */ >> + u8 force : 1; >>u16 family; >>struct md_mark mark; >>struct md_labels labels; >> @@ -631,10 +632,13 @@ static bool skb_nfct_cached(struct net *net, >> */ >>if (!ct && key->ct.state & OVS_CS_F_TRACKED && >>!(key->ct.state & OVS_CS_F_INVALID) && >> - key->ct.zone == info->zone.id) >> + key->ct.zone == info->zone.id) { >>ct = ovs_ct_find_existing(net, >zone, info->family, skb, >> !!(key->ct.state >> & OVS_CS_F_NAT_MASK)); >> + if (ct) >> + nf_ct_get(skb, ); >> + } > > If ctinfo is only used with the new call below, we can unconditionally > fetch this just before it's used... > >>if (!ct) >>return false; >>
Re: [PATCH net-next 4/7] openvswitch: Inherit master's labels.
> On Feb 6, 2017, at 1:53 PM, Joe Stringer <j...@ovn.org> wrote: > > On 2 February 2017 at 17:10, Jarno Rajahalme <ja...@ovn.org> wrote: >> We avoid calling into nf_conntrack_in() for expected connections, as >> that would remove the expectation that we want to stick around until >> we are ready to commit the connection. Instead, we do a lookup in the >> expectation table directly. However, after a successful expectation >> lookup we have set the flow key label field from the master >> connection, whereas nf_conntrack_in() does not do this. This leads to >> master's labels being iherited after an expectation lookup, but those >> labels not being inherited after the corresponding conntrack action >> with a commit flag. >> >> This patch resolves the problem by changing the commit code path to >> also inherit the master's labels to the expected connection. >> Resolving this conflict in favor or inheriting the labels allows >> information be passed from the master connection to related >> connections, which would otherwise be much harder. Labels can still >> be set explicitly, so this change only affects the default values of >> the labels in presense of a master connection. >> >> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> >> --- >> net/openvswitch/conntrack.c | 48 >> - >> 1 file changed, 34 insertions(+), 14 deletions(-) >> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c >> index a163c44..738a4fa 100644 >> --- a/net/openvswitch/conntrack.c >> +++ b/net/openvswitch/conntrack.c >> @@ -265,6 +265,8 @@ static struct nf_conn_labels >> *ovs_ct_get_conn_labels(struct nf_conn *ct) >>return cl; >> } >> >> +static bool labels_nonzero(const struct ovs_key_ct_labels *labels); >> + > > These declarations typically live at the top of the file, not > somewhere in the middle. > Moved. >> /* Initialize labels for a new, to be committed conntrack entry. Note that >> * since the new connection is not yet confirmed, and thus no-one else has >> * access to it's labels, we simply write them over. Also, we refrain from >> @@ -275,18 +277,35 @@ static int ovs_ct_init_labels(struct nf_conn *ct, >> struct sw_flow_key *key, >> const struct ovs_key_ct_labels *labels, >> const struct ovs_key_ct_labels *mask) >> { >> - struct nf_conn_labels *cl; >> - u32 *dst; >> - int i; >> + struct nf_conn_labels *cl, *master_cl; >> + bool have_mask = labels_nonzero(mask); >> + >> + /* Inherit master's labels to the related connection? */ >> + master_cl = (ct->master) ? nf_ct_labels_find(ct->master) : NULL; >> + >> + if (!master_cl && !have_mask) >> + return 0; /* Nothing to do. */ >> >>cl = ovs_ct_get_conn_labels(ct); >>if (!cl) >>return -ENOSPC; >> >> - dst = (u32 *)cl->bits; >> - for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) >> - dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | >> - (labels->ct_labels_32[i] & mask->ct_labels_32[i]); >> + /* Inherit the master's labels, if any. */ >> + if (master_cl) { >> + size_t len = sizeof(master_cl->bits); >> + >> + memcpy(>bits, _cl->bits, >> + len > OVS_CT_LABELS_LEN ? OVS_CT_LABELS_LEN : len); > > Looks like this is another spot where we're trying to handle differing > label lengths, which we could simplify if there was a stronger > guarantee they're the same. Indeed, here the ‘cl’s are different instances of the same structure, so we do need not worry about the sizes at all. I’ll change this to an simple structure assignment for v2. > >> + } >> + if (have_mask) { >> + u32 *dst = (u32 *)cl->bits; >> + int i; >> + >> + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) >> + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | >> + (labels->ct_labels_32[i] >> +& mask->ct_labels_32[i]); >> + } > > By the way, is this open-coding nf_connlabels_replace()? Can > ovs_ct_set_labels() and this share the code? nf_connlabels_r
Re: [PATCH net-next 3/7] openvswitch: Do not trigger events for unconfirmed connection.
Thanks for the review! Comments below, Jarno > On Feb 6, 2017, at 1:46 PM, Joe Stringer <j...@ovn.org> wrote: > > On 2 February 2017 at 17:10, Jarno Rajahalme <ja...@ovn.org> wrote: >> Avoid triggering change events for setting conntrack mark or labels >> before the conntrack entry has been confirmed. Refactoring on this >> patch also makes chenges in later patches easier to review. >> >> Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark") >> Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label") >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > > Functional and cosmetic changes should be in separate patches. > OK, will split. >> --- >> net/openvswitch/conntrack.c | 87 >> - >> 1 file changed, 63 insertions(+), 24 deletions(-) >> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c >> index 6730f09..a163c44 100644 >> --- a/net/openvswitch/conntrack.c >> +++ b/net/openvswitch/conntrack.c >> @@ -229,23 +229,17 @@ int ovs_ct_put_key(const struct sw_flow_key *key, >> struct sk_buff *skb) >>return 0; >> } >> >> -static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, >> +static int ovs_ct_set_mark(struct nf_conn *ct, struct sw_flow_key *key, >> u32 ct_mark, u32 mask) >> { >> #if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) >> - enum ip_conntrack_info ctinfo; >> - struct nf_conn *ct; >>u32 new_mark; >> >> - /* The connection could be invalid, in which case set_mark is no-op. >> */ >> - ct = nf_ct_get(skb, ); >> - if (!ct) >> - return 0; >> - >>new_mark = ct_mark | (ct->mark & ~(mask)); >>if (ct->mark != new_mark) { >>ct->mark = new_mark; >> - nf_conntrack_event_cache(IPCT_MARK, ct); >> + if (nf_ct_is_confirmed(ct)) >> + nf_conntrack_event_cache(IPCT_MARK, ct); >>key->ct.mark = new_mark; >>} >> >> @@ -255,26 +249,59 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct >> sw_flow_key *key, >> #endif >> } >> >> -static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, >> -const struct ovs_key_ct_labels *labels, >> -const struct ovs_key_ct_labels *mask) >> +static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct) >> { >> - enum ip_conntrack_info ctinfo; >>struct nf_conn_labels *cl; >> - struct nf_conn *ct; >> - int err; >> - >> - /* The connection could be invalid, in which case set_label is >> no-op.*/ >> - ct = nf_ct_get(skb, ); >> - if (!ct) >> - return 0; >> >>cl = nf_ct_labels_find(ct); >>if (!cl) { >>nf_ct_labels_ext_add(ct); >>cl = nf_ct_labels_find(ct); >>} >> + >>if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN) >> + return NULL; >> + >> + return cl; >> +} >> + >> +/* Initialize labels for a new, to be committed conntrack entry. Note that >> + * since the new connection is not yet confirmed, and thus no-one else has >> + * access to it's labels, we simply write them over. Also, we refrain from >> + * triggering events, as receiving change events before the create event >> would >> + * be confusing. >> + */ >> +static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, >> + const struct ovs_key_ct_labels *labels, >> + const struct ovs_key_ct_labels *mask) >> +{ >> + struct nf_conn_labels *cl; >> + u32 *dst; >> + int i; >> + >> + cl = ovs_ct_get_conn_labels(ct); >> + if (!cl) >> + return -ENOSPC; >> + >> + dst = (u32 *)cl->bits; > > Is it worth extending the union to include unsigned long, to avoid > casting it to u32 here? > This cast is on the struct nf_conn_labels, I would not unionize it at this point. This type of cast is typical in conntrack code. >> + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) >> + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | >> + (labels->ct_lab
Re: [PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key.
> On Feb 6, 2017, at 11:15 PM, Joe Stringer <j...@ovn.org> wrote: > > On 2 February 2017 at 17:10, Jarno Rajahalme <ja...@ovn.org> wrote: >> Add the fields of the conntrack original direction 5-tuple to struct >> sw_flow_key. The new fields are initially zeroed, and are populated >> whenever a conntrack action is executed and either finds or generates >> a conntrack entry. This means that these fields exist for all packets >> were not rejected by conntrack as untrackable. >> >> The original tuple fields in the sw_flow_key are filled from the >> original direction tuple of the conntrack entry relating to the >> current packet, or from the original direction tuple of the master >> conntrack entry, if the current conntrack entry has a master. >> Generally, expected connections of connections having an assigned >> helper (e.g., FTP), have a master conntrack entry. >> >> The main purpose of the new conntrack original tuple fields is to >> allow matching on them for policy decision purposes, with the premise >> that the admissibility of tracked connections reply packets (as well >> as original direction packets), and both direction packets of any >> related connections may be based on ACL rules applying to the master >> connection's original direction 5-tuple. This also makes it easier to >> make policy decisions when the actual packet headers might have been >> transformed by NAT, as the original direction 5-tuple represents the >> packet headers before any such transformation. >> >> When using the original direction 5-tuple the admissibility of return >> and/or related packets need not be based on the mere existence of a >> conntrack entry, allowing separation of admission policy from the >> established conntrack state. While existence of a conntrack entry is >> required for admission of the return or related packets, policy >> changes can render connections that were initially admitted to be >> rejected or dropped afterwards. If the admission of the return and >> related packets was based on mere conntrack state (e.g., connection >> being in an established state), a policy change that would make the >> connection rejected or dropped would need to find and delete all >> conntrack entries affected by such a change. When using the original >> direction 5-tuple matching the affected conntrack entries can be >> allowed to time out instead, as the established state of the >> connection would not need to be the basis for packet admission any >> more. >> >> It should be noted that the directionality of related connections may >> be the same or different than that of the master connection, and >> neither the original direction 5-tuple nor the conntrack state bits >> carry this information. If needed, the directionality of the master >> connection can be stored in master's conntrack mark or labels, which >> are automatically inherited by the expected related connections. >> >> The fact that neither ARP not ND packets are trackable by conntrack >> allows mutual exclusion between ARP/ND and the new conntrack original >> tuple fields. Hence, the IP addresses are overlaid in union with ARP >> and ND fields. This allows the sw_flow_key to not grow much due to >> this patch, but it also means that we must be careful to never use the >> new key fields with ARP or ND packets. ARP is easy to distinguish and >> keep mutually exclusive based on the ethernet type, but ND being an >> ICMPv6 protocol requires a bit more attention. >> >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> >> --- > > OK, maybe we need to do something a bit more to handle the NATed > related connections to address the problem in patch 1. > > > >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c >> index 738a4fa..1afe153 100644 >> --- a/net/openvswitch/conntrack.c >> +++ b/net/openvswitch/conntrack.c >> @@ -155,6 +155,59 @@ static void __ovs_ct_update_key(struct sw_flow_key >> *key, u8 state, >>key->ct.zone = zone->id; >>key->ct.mark = ovs_ct_get_mark(ct); >>ovs_ct_get_labels(ct, >ct.labels); >> + >> + /* Use the master if we have one. */ >> + if (ct && ct->master) >> + ct = ct->master; > > Perhaps: > > if (!ct || sw_flow_key_is_nd(key) || !is_ip_any(key->eth.type)) { >/* zero everything */ >return; > } > > One of the things this helps us to avoid is having a comment in the > middle of an if statement. > > Then afterwards,
Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
> On Feb 6, 2017, at 9:07 AM, Pravin Shelar <pshe...@ovn.org> wrote: > > On Thu, Feb 2, 2017 at 5:10 PM, Jarno Rajahalme <ja...@ovn.org> wrote: >> When looking for an existing conntrack entry, the packet 5-tuple >> must be inverted if NAT has already been applied, as the current >> packet headers do not match any conntrack tuple. For >> example, if a packet from private address X to a public address B is >> source-NATted to A, the conntrack entry will have the following tuples >> (ignoring the protocol and port numbers) after the conntrack entry is >> committed: >> >> Original direction tuple: (X,B) >> Reply direction tuple: (B,A) >> >> Now, if a reply packet is already transformed back to the private >> address space (e.g., with a CT(nat) action), the tuple corresponding >> to the current packet headers is: >> >> Current packet tuple: (B,X) >> >> This does not match either of the conntrack tuples above. Normally >> this does not matter, as the conntrack lookup was already done using >> the tuple (B,A), but if the current packet does not match any flow in >> the OVS datapath, the packet is sent to userspace via an upcall, >> during which the packet's skb is freed, and the conntrack entry >> pointer in the skb is lost. When the packet is reintroduced to the >> datapath, any further conntrack action will need to perform a new >> conntrack lookup to find the entry again. Prior to this patch this >> second lookup failed for NATted packets. The datapath flow setup >> corresponding to the upcall can succeed, however, allowing all further >> packets in the reply direction to re-use the conntrack entry pointer >> in the skb, so typically the lookup failure only causes a packet drop. >> >> The solution is to invert the tuple derived from the current packet >> headers in case the conntrack state stored in the packet metadata >> indicates that the packet has been transformed by NAT: >> >> Inverted tuple: (X,B) >> >> With this the conntrack entry can be found, matching the original >> direction tuple. >> >> This same logic also works for the original direction packets: >> >> Current packet tuple (after NAT): (A,B) >> Inverted tuple: (B,A) >> >> While the current packet tuple (A,B) does not match either of the >> conntrack tuples, the inverted one (B,A) does match the reply >> direction tuple. >> >> Since the inverted tuple matches the reverse direction tuple the >> direction of the packet must be reversed as well. >> >> Fixes: 05752523e565 ("openvswitch: Interface with NAT.") >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > > I could not apply this patch series to net-next branch. But it does > applies to net, which branch are you targeting it for? The patches were against net-next, but there likely was a merge from netfilter around the time of me sending the email out causing the difficulty. Will address all comments, rebase and post a v2 later today. Jarno
Re: [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key.
> On Feb 6, 2017, at 11:15 PM, Joe Stringer <j...@ovn.org> wrote: > > On 2 February 2017 at 17:10, Jarno Rajahalme <ja...@ovn.org> wrote: >> struct sw_flow_key has two 16-bit holes. Move the most matched >> conntrack match fields there. In some typical cases this reduces the >> size of the key that needs to be hashed into half and into one cache >> line. >> >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > > Looks like this misses the zeroing in ovs_nla_get_flow_metadata(); > might want to double-check for any other memset/copies of the key->ct > field. Good catch. Looked, there are no other places to change. Will rebase to current net-next and repost. Jarno
Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
> On Feb 7, 2017, at 9:14 AM, Pravin Shelar <pshe...@ovn.org> wrote: > > On Mon, Feb 6, 2017 at 9:15 AM, David Miller <da...@davemloft.net> wrote: >> From: Pravin Shelar <pshe...@ovn.org> >> Date: Mon, 6 Feb 2017 09:06:29 -0800 >> >>> On Sun, Feb 5, 2017 at 2:28 PM, David Miller <da...@davemloft.net> wrote: >>>> From: Jarno Rajahalme <ja...@ovn.org> >>>> Date: Thu, 2 Feb 2017 17:10:00 -0800 >>>> >>>>> This does not match either of the conntrack tuples above. Normally >>>>> this does not matter, as the conntrack lookup was already done using >>>>> the tuple (B,A), but if the current packet does not match any flow in >>>>> the OVS datapath, the packet is sent to userspace via an upcall, >>>>> during which the packet's skb is freed, and the conntrack entry >>>>> pointer in the skb is lost. >>>> >>>> This is the real bug. >>>> >>>> If the metadata for a packet is important, as it obviously is here, >>>> then upcalls should preserve that metadata across the upcall. This >>>> is exactly how NF_QUEUE handles this problem and so should OVS. >>> >>> OVS kernel datapath serializes skb context and sends it along with skb >>> during upcall via genetlink parameters. userspace echoes same >>> information back to kernel modules. This way OVS does maintains >>> metadata across upcall. >> >> Then the conntrack state looked up before the NAT operation done in >> the upcall should be maintained and therefore this bug should not >> exist. > We already maintain enough metadata across the userspace upcall, but so far we have failed to use it correctly in the case of NATted packets. The NAT flags are there, based on which we (now) know that we have to do inverted lookup to locate the conntrack entry. Conntrack NAT by design only stores tuples for the incoming packets, and there are various instances of tuple inversions in netfilter code for this reason (see, for example, net/netfilter/nf_nat_core.c:nf_nat_used_tuple()). I’ll make the commit message clearer about this for the v2 of the series. > I think it fails because after upcall OVS is performing lookup for > nonexistent conntrack entry. This is due to the fact that the packet > is already Nat-ed. So one could argue that there is already enough > context available in OVS to lookup original conntract entry after the > upcall as shown in this patch. > But I am also fine with using original context to lookup the conntract > entry as Joe has suggested. For expected (i.e., related) connection, using the master’s original direction 5-tuple added to the flow meta-data in patch 5 would find the master connection, not the related connection, so it would not work. Jarno
[PATCH net-next 3/7] openvswitch: Do not trigger events for unconfirmed connection.
Avoid triggering change events for setting conntrack mark or labels before the conntrack entry has been confirmed. Refactoring on this patch also makes chenges in later patches easier to review. Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark") Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 87 - 1 file changed, 63 insertions(+), 24 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 6730f09..a163c44 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -229,23 +229,17 @@ int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb) return 0; } -static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, +static int ovs_ct_set_mark(struct nf_conn *ct, struct sw_flow_key *key, u32 ct_mark, u32 mask) { #if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) - enum ip_conntrack_info ctinfo; - struct nf_conn *ct; u32 new_mark; - /* The connection could be invalid, in which case set_mark is no-op. */ - ct = nf_ct_get(skb, ); - if (!ct) - return 0; - new_mark = ct_mark | (ct->mark & ~(mask)); if (ct->mark != new_mark) { ct->mark = new_mark; - nf_conntrack_event_cache(IPCT_MARK, ct); + if (nf_ct_is_confirmed(ct)) + nf_conntrack_event_cache(IPCT_MARK, ct); key->ct.mark = new_mark; } @@ -255,26 +249,59 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, #endif } -static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, -const struct ovs_key_ct_labels *labels, -const struct ovs_key_ct_labels *mask) +static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct) { - enum ip_conntrack_info ctinfo; struct nf_conn_labels *cl; - struct nf_conn *ct; - int err; - - /* The connection could be invalid, in which case set_label is no-op.*/ - ct = nf_ct_get(skb, ); - if (!ct) - return 0; cl = nf_ct_labels_find(ct); if (!cl) { nf_ct_labels_ext_add(ct); cl = nf_ct_labels_find(ct); } + if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN) + return NULL; + + return cl; +} + +/* Initialize labels for a new, to be committed conntrack entry. Note that + * since the new connection is not yet confirmed, and thus no-one else has + * access to it's labels, we simply write them over. Also, we refrain from + * triggering events, as receiving change events before the create event would + * be confusing. + */ +static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, + const struct ovs_key_ct_labels *labels, + const struct ovs_key_ct_labels *mask) +{ + struct nf_conn_labels *cl; + u32 *dst; + int i; + + cl = ovs_ct_get_conn_labels(ct); + if (!cl) + return -ENOSPC; + + dst = (u32 *)cl->bits; + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | + (labels->ct_labels_32[i] & mask->ct_labels_32[i]); + + memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); + + return 0; +} + +static int ovs_ct_set_labels(struct nf_conn *ct, struct sw_flow_key *key, +const struct ovs_key_ct_labels *labels, +const struct ovs_key_ct_labels *mask) +{ + struct nf_conn_labels *cl; + int err; + + cl = ovs_ct_get_conn_labels(ct); + if (!cl) return -ENOSPC; err = nf_connlabels_replace(ct, labels->ct_labels_32, @@ -283,7 +310,8 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, if (err) return err; - ovs_ct_get_labels(ct, >ct.labels); + memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); + return 0; } @@ -865,25 +893,36 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, const struct ovs_conntrack_info *info, struct sk_buff *skb) { + enum ip_conntrack_info ctinfo; + struct nf_conn *ct; int err; err = __ovs_ct_lookup(net, key, info, skb); if (err) return err; + /* The connection could be invalid, in which case this is a no-op.*/ + ct = nf_ct_get(skb, ); + if (!ct) + return 0; + /* Apply changes before confirming the connection so that the initial
[PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key.
Add the fields of the conntrack original direction 5-tuple to struct sw_flow_key. The new fields are initially zeroed, and are populated whenever a conntrack action is executed and either finds or generates a conntrack entry. This means that these fields exist for all packets were not rejected by conntrack as untrackable. The original tuple fields in the sw_flow_key are filled from the original direction tuple of the conntrack entry relating to the current packet, or from the original direction tuple of the master conntrack entry, if the current conntrack entry has a master. Generally, expected connections of connections having an assigned helper (e.g., FTP), have a master conntrack entry. The main purpose of the new conntrack original tuple fields is to allow matching on them for policy decision purposes, with the premise that the admissibility of tracked connections reply packets (as well as original direction packets), and both direction packets of any related connections may be based on ACL rules applying to the master connection's original direction 5-tuple. This also makes it easier to make policy decisions when the actual packet headers might have been transformed by NAT, as the original direction 5-tuple represents the packet headers before any such transformation. When using the original direction 5-tuple the admissibility of return and/or related packets need not be based on the mere existence of a conntrack entry, allowing separation of admission policy from the established conntrack state. While existence of a conntrack entry is required for admission of the return or related packets, policy changes can render connections that were initially admitted to be rejected or dropped afterwards. If the admission of the return and related packets was based on mere conntrack state (e.g., connection being in an established state), a policy change that would make the connection rejected or dropped would need to find and delete all conntrack entries affected by such a change. When using the original direction 5-tuple matching the affected conntrack entries can be allowed to time out instead, as the established state of the connection would not need to be the basis for packet admission any more. It should be noted that the directionality of related connections may be the same or different than that of the master connection, and neither the original direction 5-tuple nor the conntrack state bits carry this information. If needed, the directionality of the master connection can be stored in master's conntrack mark or labels, which are automatically inherited by the expected related connections. The fact that neither ARP not ND packets are trackable by conntrack allows mutual exclusion between ARP/ND and the new conntrack original tuple fields. Hence, the IP addresses are overlaid in union with ARP and ND fields. This allows the sw_flow_key to not grow much due to this patch, but it also means that we must be careful to never use the new key fields with ARP or ND packets. ARP is easy to distinguish and keep mutually exclusive based on the ethernet type, but ND being an ICMPv6 protocol requires a bit more attention. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 20 - net/openvswitch/actions.c| 2 + net/openvswitch/conntrack.c | 95 +--- net/openvswitch/conntrack.h | 13 +- net/openvswitch/flow.c | 34 +++--- net/openvswitch/flow.h | 49 - net/openvswitch/flow_netlink.c | 91 +- net/openvswitch/flow_netlink.h | 7 ++- 8 files changed, 264 insertions(+), 47 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 96aee34..90af8b8 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -1,6 +1,6 @@ /* - * Copyright (c) 2007-2013 Nicira, Inc. + * Copyright (c) 2007-2017 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of version 2 of the GNU General Public @@ -331,6 +331,8 @@ enum ovs_key_attr { OVS_KEY_ATTR_CT_ZONE, /* u16 connection tracking zone. */ OVS_KEY_ATTR_CT_MARK, /* u32 connection tracking mark */ OVS_KEY_ATTR_CT_LABELS, /* 16-octet connection tracking label */ + OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, /* struct ovs_key_ct_tuple_ipv4 */ + OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, /* struct ovs_key_ct_tuple_ipv6 */ #ifdef __KERNEL__ OVS_KEY_ATTR_TUNNEL_INFO, /* struct ip_tunnel_info */ @@ -472,6 +474,22 @@ struct ovs_key_ct_labels { #define OVS_CS_F_NAT_MASK (OVS_CS_F_SRC_NAT | OVS_CS_F_DST_NAT) +struct ovs_key_ct_tuple_ipv4 { + __be32 ipv4_src; + __be32 ipv4_dst; + __be16 src_port; + __be16 dst_port; + __u8 ipv4_proto; +}; + +struct ovs_key_ct_tuple_ipv6 { + __be32 ip
[PATCH net-next 2/7] openvswitch: Unionize ovs_key_ct_label with a u32 array.
Make the array of labels in struct ovs_key_ct_label an union, adding a u32 array of the same byte size as the existing u8 array. It is faster to loop through the labels 32 bits at the time, which is also the alignment of netlink attributes. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 8 ++-- net/openvswitch/conntrack.c | 9 + 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 375d812..96aee34 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -446,9 +446,13 @@ struct ovs_key_nd { __u8nd_tll[ETH_ALEN]; }; -#define OVS_CT_LABELS_LEN 16 +#define OVS_CT_LABELS_LEN_32 4 +#define OVS_CT_LABELS_LEN (OVS_CT_LABELS_LEN_32 * sizeof(__u32)) struct ovs_key_ct_labels { - __u8ct_labels[OVS_CT_LABELS_LEN]; + union { + __u8ct_labels[OVS_CT_LABELS_LEN]; + __u32 ct_labels_32[OVS_CT_LABELS_LEN_32]; + }; }; /* OVS_KEY_ATTR_CT_STATE flags */ diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index b91baa2..6730f09 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -277,8 +277,9 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN) return -ENOSPC; - err = nf_connlabels_replace(ct, (u32 *)labels, (u32 *)mask, - OVS_CT_LABELS_LEN / sizeof(u32)); + err = nf_connlabels_replace(ct, labels->ct_labels_32, + mask->ct_labels_32, + OVS_CT_LABELS_LEN_32); if (err) return err; @@ -852,8 +853,8 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels) { size_t i; - for (i = 0; i < sizeof(*labels); i++) - if (labels->ct_labels[i]) + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + if (labels->ct_labels_32[i]) return true; return false; -- 2.1.4
[PATCH net-next 6/7] openvswitch: Add force commit.
Stateful network admission policy may allow connections to one direction and reject connections initiated in the other direction. After policy change it is possible that for a new connection an overlapping conntrack entry already exist, where the connection original direction is opposed to the new connection's initial packet. Most importantly, conntrack state relating to the current packet gets the "reply" designation based on whether the original direction tuple or the reply direction tuple matched. If this "directionality" is wrong w.r.t. to the stateful network admission policy it may happen that packets in neither direction are correctly admitted. This patch adds a new "force commit" option to the OVS conntrack action that checks the original direction of an existing conntrack entry. If that direction is opposed to the current packet, the existing conntrack entry is deleted and a new one is subsequently created in the correct direction. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/uapi/linux/openvswitch.h | 10 ++ net/openvswitch/conntrack.c | 27 +-- 2 files changed, 35 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 90af8b8..d5ba9a9 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -674,6 +674,10 @@ struct ovs_action_hash { * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG. * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address * translation (NAT) on the packet. + * @OVS_CT_ATTR_FORCE_COMMIT: Like %OVS_CT_ATTR_COMMIT, but instead of doing + * nothing if the connection is already committed will check that the current + * packet is in conntrack entry's original direction. If directionality does + * not match, will delete the existing conntrack entry and commit a new one. */ enum ovs_ct_attr { OVS_CT_ATTR_UNSPEC, @@ -684,6 +688,12 @@ enum ovs_ct_attr { OVS_CT_ATTR_HELPER, /* netlink helper to assist detection of related connections. */ OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ + OVS_CT_ATTR_FORCE_COMMIT, /* No argument, commits connection. If the + * conntrack entry original direction tuple + * does not match the current packet header + * values, will delete the current conntrack + * entry and create a new one. + */ __OVS_CT_ATTR_MAX }; diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 1afe153..1f27f44 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -65,6 +65,7 @@ struct ovs_conntrack_info { struct nf_conn *ct; u8 commit : 1; u8 nat : 3; /* enum ovs_ct_nat */ + u8 force : 1; u16 family; struct md_mark mark; struct md_labels labels; @@ -631,10 +632,13 @@ static bool skb_nfct_cached(struct net *net, */ if (!ct && key->ct.state & OVS_CS_F_TRACKED && !(key->ct.state & OVS_CS_F_INVALID) && - key->ct.zone == info->zone.id) + key->ct.zone == info->zone.id) { ct = ovs_ct_find_existing(net, >zone, info->family, skb, !!(key->ct.state & OVS_CS_F_NAT_MASK)); + if (ct) + nf_ct_get(skb, ); + } if (!ct) return false; if (!net_eq(net, read_pnet(>ct_net))) @@ -648,6 +652,19 @@ static bool skb_nfct_cached(struct net *net, if (help && rcu_access_pointer(help->helper) != info->helper) return false; } + /* Force conntrack entry direction to the current packet? */ + if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) { + /* Delete the conntrack entry if confirmed, else just release +* the reference. +*/ + if (nf_ct_is_confirmed(ct)) + nf_ct_delete(ct, 0, 0); + else + nf_ct_put(ct); + skb->nfct = NULL; + skb->nfctinfo = 0; + return false; + } return true; } @@ -1226,6 +1243,7 @@ static int parse_nat(const struct nlattr *attr, static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = { [OVS_CT_ATTR_COMMIT]= { .minlen = 0, .maxlen = 0 }, + [OVS_CT_ATTR_FORCE_COMMIT] = { .minlen = 0, .maxlen = 0 }, [OVS_CT_ATTR_ZONE] = { .minlen = sizeof(u16), .ma
[PATCH net-next 4/7] openvswitch: Inherit master's labels.
We avoid calling into nf_conntrack_in() for expected connections, as that would remove the expectation that we want to stick around until we are ready to commit the connection. Instead, we do a lookup in the expectation table directly. However, after a successful expectation lookup we have set the flow key label field from the master connection, whereas nf_conntrack_in() does not do this. This leads to master's labels being iherited after an expectation lookup, but those labels not being inherited after the corresponding conntrack action with a commit flag. This patch resolves the problem by changing the commit code path to also inherit the master's labels to the expected connection. Resolving this conflict in favor or inheriting the labels allows information be passed from the master connection to related connections, which would otherwise be much harder. Labels can still be set explicitly, so this change only affects the default values of the labels in presense of a master connection. Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 48 - 1 file changed, 34 insertions(+), 14 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index a163c44..738a4fa 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -265,6 +265,8 @@ static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct) return cl; } +static bool labels_nonzero(const struct ovs_key_ct_labels *labels); + /* Initialize labels for a new, to be committed conntrack entry. Note that * since the new connection is not yet confirmed, and thus no-one else has * access to it's labels, we simply write them over. Also, we refrain from @@ -275,18 +277,35 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key, const struct ovs_key_ct_labels *labels, const struct ovs_key_ct_labels *mask) { - struct nf_conn_labels *cl; - u32 *dst; - int i; + struct nf_conn_labels *cl, *master_cl; + bool have_mask = labels_nonzero(mask); + + /* Inherit master's labels to the related connection? */ + master_cl = (ct->master) ? nf_ct_labels_find(ct->master) : NULL; + + if (!master_cl && !have_mask) + return 0; /* Nothing to do. */ cl = ovs_ct_get_conn_labels(ct); if (!cl) return -ENOSPC; - dst = (u32 *)cl->bits; - for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) - dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | - (labels->ct_labels_32[i] & mask->ct_labels_32[i]); + /* Inherit the master's labels, if any. */ + if (master_cl) { + size_t len = sizeof(master_cl->bits); + + memcpy(>bits, _cl->bits, + len > OVS_CT_LABELS_LEN ? OVS_CT_LABELS_LEN : len); + } + if (have_mask) { + u32 *dst = (u32 *)cl->bits; + int i; + + for (i = 0; i < OVS_CT_LABELS_LEN_32; i++) + dst[i] = (dst[i] & ~mask->ct_labels_32[i]) | + (labels->ct_labels_32[i] +& mask->ct_labels_32[i]); + } memcpy(>ct.labels, cl->bits, OVS_CT_LABELS_LEN); @@ -916,13 +935,14 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, if (err) return err; } - if (labels_nonzero(>labels.mask)) { - if (!nf_ct_is_confirmed(ct)) - err = ovs_ct_init_labels(ct, key, >labels.value, ->labels.mask); - else - err = ovs_ct_set_labels(ct, key, >labels.value, - >labels.mask); + if (!nf_ct_is_confirmed(ct)) { + err = ovs_ct_init_labels(ct, key, >labels.value, +>labels.mask); + if (err) + return err; + } else if (labels_nonzero(>labels.mask)) { + err = ovs_ct_set_labels(ct, key, >labels.value, + >labels.mask); if (err) return err; } -- 2.1.4
[PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
When looking for an existing conntrack entry, the packet 5-tuple must be inverted if NAT has already been applied, as the current packet headers do not match any conntrack tuple. For example, if a packet from private address X to a public address B is source-NATted to A, the conntrack entry will have the following tuples (ignoring the protocol and port numbers) after the conntrack entry is committed: Original direction tuple: (X,B) Reply direction tuple: (B,A) Now, if a reply packet is already transformed back to the private address space (e.g., with a CT(nat) action), the tuple corresponding to the current packet headers is: Current packet tuple: (B,X) This does not match either of the conntrack tuples above. Normally this does not matter, as the conntrack lookup was already done using the tuple (B,A), but if the current packet does not match any flow in the OVS datapath, the packet is sent to userspace via an upcall, during which the packet's skb is freed, and the conntrack entry pointer in the skb is lost. When the packet is reintroduced to the datapath, any further conntrack action will need to perform a new conntrack lookup to find the entry again. Prior to this patch this second lookup failed for NATted packets. The datapath flow setup corresponding to the upcall can succeed, however, allowing all further packets in the reply direction to re-use the conntrack entry pointer in the skb, so typically the lookup failure only causes a packet drop. The solution is to invert the tuple derived from the current packet headers in case the conntrack state stored in the packet metadata indicates that the packet has been transformed by NAT: Inverted tuple: (X,B) With this the conntrack entry can be found, matching the original direction tuple. This same logic also works for the original direction packets: Current packet tuple (after NAT): (A,B) Inverted tuple: (B,A) While the current packet tuple (A,B) does not match either of the conntrack tuples, the inverted one (B,A) does match the reply direction tuple. Since the inverted tuple matches the reverse direction tuple the direction of the packet must be reversed as well. Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 54253ea..b91baa2 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -430,7 +430,7 @@ ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h) */ static struct nf_conn * ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, -u8 l3num, struct sk_buff *skb) +u8 l3num, struct sk_buff *skb, bool natted) { struct nf_conntrack_l3proto *l3proto; struct nf_conntrack_l4proto *l4proto; @@ -453,6 +453,17 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, return NULL; } + /* Must invert the tuple if skb has been transformed by NAT. */ + if (natted) { + struct nf_conntrack_tuple inverse; + + if (!nf_ct_invert_tuple(, , l3proto, l4proto)) { + pr_debug("ovs_ct_find_existing: Inversion failed!\n"); + return NULL; + } + tuple = inverse; + } + /* look for tuple match */ h = nf_conntrack_find_get(net, zone, ); if (!h) @@ -460,6 +471,13 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, ct = nf_ct_tuplehash_to_ctrack(h); + /* Inverted packet tuple matches the reverse direction conntrack tuple, +* select the other tuplehash to get the right 'ctinfo' bits for this +* packet. +*/ + if (natted) + h = >tuplehash[!h->tuple.dst.dir]; + skb->nfct = >ct_general; skb->nfctinfo = ovs_ct_get_info(h); return ct; @@ -483,7 +501,9 @@ static bool skb_nfct_cached(struct net *net, if (!ct && key->ct.state & OVS_CS_F_TRACKED && !(key->ct.state & OVS_CS_F_INVALID) && key->ct.zone == info->zone.id) - ct = ovs_ct_find_existing(net, >zone, info->family, skb); + ct = ovs_ct_find_existing(net, >zone, info->family, skb, + !!(key->ct.state +& OVS_CS_F_NAT_MASK)); if (!ct) return false; if (!net_eq(net, read_pnet(>ct_net))) -- 2.1.4
[PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key.
struct sw_flow_key has two 16-bit holes. Move the most matched conntrack match fields there. In some typical cases this reduces the size of the key that needs to be hashed into half and into one cache line. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c| 42 +- net/openvswitch/conntrack.h| 6 +++--- net/openvswitch/flow.h | 14 -- net/openvswitch/flow_netlink.c | 8 4 files changed, 36 insertions(+), 34 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 1f27f44..e8d29c0 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -152,8 +152,8 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, const struct nf_conntrack_zone *zone, const struct nf_conn *ct) { - key->ct.state = state; - key->ct.zone = zone->id; + key->ct_state = state; + key->ct_zone = zone->id; key->ct.mark = ovs_ct_get_mark(ct); ovs_ct_get_labels(ct, >ct.labels); @@ -161,7 +161,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, if (ct && ct->master) ct = ct->master; - key->ct.orig_proto = 0; + key->ct_orig_proto = 0; key->ct.orig_tp.src = 0; key->ct.orig_tp.dst = 0; if (key->eth.type == htons(ETH_P_IP)) { @@ -172,7 +172,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, key->ipv4.ct_orig.src = orig->src.u3.ip; key->ipv4.ct_orig.dst = orig->dst.u3.ip; - key->ct.orig_proto = orig->dst.protonum; + key->ct_orig_proto = orig->dst.protonum; if (orig->dst.protonum == IPPROTO_ICMP) { key->ct.orig_tp.src = htons(orig->dst.u.icmp.type); @@ -195,7 +195,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state, key->ipv6.ct_orig.src = orig->src.u3.in6; key->ipv6.ct_orig.dst = orig->dst.u3.in6; - key->ct.orig_proto = orig->dst.protonum; + key->ct_orig_proto = orig->dst.protonum; if (orig->dst.protonum == IPPROTO_ICMPV6) { key->ct.orig_tp.src = htons(orig->dst.u.icmp.type); @@ -238,7 +238,7 @@ static void ovs_ct_update_key(const struct sk_buff *skb, if (ct->master) state |= OVS_CS_F_RELATED; if (keep_nat_flags) { - state |= key->ct.state & OVS_CS_F_NAT_MASK; + state |= key->ct_state & OVS_CS_F_NAT_MASK; } else { if (ct->status & IPS_SRC_NAT) state |= OVS_CS_F_SRC_NAT; @@ -269,11 +269,11 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key) int ovs_ct_put_key(const struct sw_flow_key *swkey, const struct sw_flow_key *output, struct sk_buff *skb) { - if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct.state)) + if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct_state)) return -EMSGSIZE; if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) && - nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct.zone)) + nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct_zone)) return -EMSGSIZE; if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) && @@ -285,25 +285,25 @@ int ovs_ct_put_key(const struct sw_flow_key *swkey, >ct.labels)) return -EMSGSIZE; - if (swkey->eth.type == htons(ETH_P_IP) && swkey->ct.orig_proto) { + if (swkey->eth.type == htons(ETH_P_IP) && swkey->ct_orig_proto) { struct ovs_key_ct_tuple_ipv4 orig = { output->ipv4.ct_orig.src, output->ipv4.ct_orig.dst, output->ct.orig_tp.src, output->ct.orig_tp.dst, - output->ct.orig_proto, + output->ct_orig_proto, }; if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, sizeof(orig), )) return -EMSGSIZE; } else if (swkey->eth.type == htons(ETH_P_IPV6) && - swkey->ct.orig_proto) { + swkey->ct_orig_proto) { struct ovs_key_ct_tuple_ipv6 orig = { IN6
Re: [net-next] openvswitch: Simplify do_execute_actions().
Nice clean-up. Acked-by: Jarno Rajahalme <ja...@ovn.org> > On Jan 25, 2017, at 9:24 PM, Andy Zhou <az...@ovn.org> wrote: > > do_execute_actions() implements a worthwhile optimization: in case > an output action is the last action in an action list, skb_clone() > can be avoided by outputing the current skb. However, the > implementation is more complicated than necessary. This patch > simplify this logic. > > Signed-off-by: Andy Zhou <az...@ovn.org> > --- > net/openvswitch/actions.c | 40 +++- > 1 file changed, 19 insertions(+), 21 deletions(-) > > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > index 514f7bc..3866608 100644 > --- a/net/openvswitch/actions.c > +++ b/net/openvswitch/actions.c > @@ -830,6 +830,9 @@ static void do_output(struct datapath *dp, struct sk_buff > *skb, int out_port, > { > struct vport *vport = ovs_vport_rcu(dp, out_port); > > + if (unlikely(!skb)) > + return; > + > if (likely(vport)) { > u16 mru = OVS_CB(skb)->mru; > u32 cutlen = OVS_CB(skb)->cutlen; > @@ -1141,12 +1144,6 @@ static int do_execute_actions(struct datapath *dp, > struct sk_buff *skb, > struct sw_flow_key *key, > const struct nlattr *attr, int len) > { > - /* Every output action needs a separate clone of 'skb', but the common > - * case is just a single output action, so that doing a clone and > - * then freeing the original skbuff is wasteful. So the following code > - * is slightly obscure just to avoid that. > - */ > - int prev_port = -1; > const struct nlattr *a; > int rem; > > @@ -1154,20 +1151,25 @@ static int do_execute_actions(struct datapath *dp, > struct sk_buff *skb, >a = nla_next(a, )) { > int err = 0; > > - if (unlikely(prev_port != -1)) { > - struct sk_buff *out_skb = skb_clone(skb, GFP_ATOMIC); > + switch (nla_type(a)) { > + case OVS_ACTION_ATTR_OUTPUT: { > + int port = nla_get_u32(a); > > - if (out_skb) > - do_output(dp, out_skb, prev_port, key); > + /* Every output action needs a separate clone > + * of 'skb', In case the output action is the > + * last action, cloning can be avoided. > + */ > + if (nla_is_last(a, rem)) { > + do_output(dp, skb, port, key); > + /* 'skb' has been used for output. > + */ > + return 0; > + } > > + do_output(dp, skb_clone(skb, GFP_ATOMIC), port, key); > OVS_CB(skb)->cutlen = 0; > - prev_port = -1; > - } > - > - switch (nla_type(a)) { > - case OVS_ACTION_ATTR_OUTPUT: > - prev_port = nla_get_u32(a); > break; > + } > > case OVS_ACTION_ATTR_TRUNC: { > struct ovs_action_trunc *trunc = nla_data(a); > @@ -1257,11 +1259,7 @@ static int do_execute_actions(struct datapath *dp, > struct sk_buff *skb, > } > } > > - if (prev_port != -1) > - do_output(dp, skb, prev_port, key); > - else > - consume_skb(skb); > - > + consume_skb(skb); > return 0; > } > > -- > 1.8.3.1 >
Re: [PATCH net] openvswitch: Add a missing break statement.
> On Dec 19, 2016, at 5:06 PM, Jarno Rajahalme <ja...@ovn.org> wrote: > > Add a break statement to prevent fall-through from > OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL. Without the break > actions setting ethernet addresses fail to validate with log messages > complaining about invalid tunnel attributes. > > Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets") > Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > Acked-by: Pravin B Shelar <pshe...@ovn.org> > Acked-by: Jiri Benc <jb...@redhat.com> > --- > net/openvswitch/flow_netlink.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c > index d19044f..c87d359 100644 > --- a/net/openvswitch/flow_netlink.c > +++ b/net/openvswitch/flow_netlink.c > @@ -2195,6 +2195,7 @@ static int validate_set(const struct nlattr *a, > case OVS_KEY_ATTR_ETHERNET: > if (mac_proto != MAC_PROTO_ETHERNET) > return -EINVAL; > + break; > > case OVS_KEY_ATTR_TUNNEL: > if (masked) > -- > 2.1.4 > I posted this separately from an earlier net-next patch series. Pravin agreed to pick up that rest of the series (skb->protocol w/ vlans). Jarno
Re: [PATCH v3 net-next 1/3] openvswitch: Add a missing break statement.
> On Dec 13, 2016, at 9:07 PM, Pravin Shelar <pshe...@ovn.org> wrote: > > On Tue, Nov 29, 2016 at 3:30 PM, Jarno Rajahalme <ja...@ovn.org> wrote: >> Add a break statement to prevent fall-through from >> OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL. Without the break >> actions setting ethernet addresses fail to validate with log messages >> complaining about invalid tunnel attributes. >> >> Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets") >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> >> Acked-by: Pravin B Shelar <pshe...@ovn.org> >> Acked-by: Jiri Benc <jb...@redhat.com> > > Hi Jarno, > Since this is straight forward patch. can you send it separately so > that we can get it merged soon? > I just did, against net. You’ll take over the rest? Jarno > Thanks, > Pravin.
[PATCH net] openvswitch: Add a missing break statement.
Add a break statement to prevent fall-through from OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL. Without the break actions setting ethernet addresses fail to validate with log messages complaining about invalid tunnel attributes. Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> Acked-by: Jiri Benc <jb...@redhat.com> --- net/openvswitch/flow_netlink.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index d19044f..c87d359 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -2195,6 +2195,7 @@ static int validate_set(const struct nlattr *a, case OVS_KEY_ATTR_ETHERNET: if (mac_proto != MAC_PROTO_ETHERNET) return -EINVAL; + break; case OVS_KEY_ATTR_TUNNEL: if (masked) -- 2.1.4
Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.
> On Nov 30, 2016, at 5:51 AM, Jiri Benc <jb...@redhat.com> wrote: > > On Tue, 29 Nov 2016 15:30:52 -0800, Jarno Rajahalme wrote: >> @@ -504,11 +485,20 @@ void ovs_vport_send(struct vport *vport, struct >> sk_buff *skb, u8 mac_proto) >> goto drop; >> } >> >> -if (unlikely(packet_length(skb, vport->dev) > mtu && >> - !skb_is_gso(skb))) { >> -net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n", >> - vport->dev->name, >> - packet_length(skb, vport->dev), mtu); >> +if (unlikely(!is_skb_forwardable(vport->dev, skb))) { > > How does this work when the vlan tag is accelerated? Then we can be > over MTU, yet the check will pass. > I’ll check how other call sites use this. Jarno > Jiri
Re: [PATCH v2 net-next 2/2] openvswitch: Fix skb->protocol for vlan frames.
> On Nov 28, 2016, at 11:21 PM, Pravin Shelar <pshe...@ovn.org> wrote: > > On Mon, Nov 28, 2016 at 6:41 PM, Jarno Rajahalme <ja...@ovn.org> wrote: >> Do not set skb->protocol to be the ethertype of the L3 header, unless >> the packet only has the L3 header. For a non-hardware offloaded VLAN >> frame skb->protocol needs to be one of the VLAN ethertypes. >> >> Any VLAN offloading is undone on the OVS netlink interface. Also any >> VLAN tags added by userspace are non-offloaded. >> >> Incorrect skb->protocol value on a full-size non-offloaded VLAN skb >> causes packet drop due to failing MTU check, as the VLAN header should >> not be counted in when considering MTU in ovs_vport_send(). >> > I think we should move to is_skb_forwardable() type of packet length > check in vport-send and get rid of skb-protocol checks altogether. > I added a new patch to do this, thanks for the suggestion. >> Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets") >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> >> --- >> v2: Set skb->protocol when an ETH_P_TEB frame is received via ARPHRD_NONE >>interface. >> >> net/openvswitch/datapath.c | 1 - >> net/openvswitch/flow.c | 30 ++ >> 2 files changed, 22 insertions(+), 9 deletions(-) > ... > ... >> @@ -531,15 +538,22 @@ static int key_extract(struct sk_buff *skb, struct >> sw_flow_key *key) >>if (unlikely(parse_vlan(skb, key))) >>return -ENOMEM; >> >> - skb->protocol = parse_ethertype(skb); >> - if (unlikely(skb->protocol == htons(0))) >> + key->eth.type = parse_ethertype(skb); >> + if (unlikely(key->eth.type == htons(0))) >>return -ENOMEM; >> >> + if (skb->protocol == htons(ETH_P_TEB)) { >> + if (key->eth.vlan.tci & htons(VLAN_TAG_PRESENT) >> + && !skb_vlan_tag_present(skb)) >> + skb->protocol = key->eth.vlan.tpid; >> + else >> + skb->protocol = key->eth.type; >> + } >> + > > I am not sure if this work in case of nested vlans. > Can we move skb-protocol assignment to parse_vlan() to avoid checking > for non-accelerated vlan case again here? I did this for the v3 that I sent out a moment ago. Thanks for the review, Jarno
[PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.
Use is_skb_forwardable() instead of an explicit length check. This gets around the apparent MTU check failure in OVS test cases when skb->protocol is not properly set in case of non-accelerated VLAN skbs. Suggested-by: Pravin Shelar <pshe...@ovn.org> Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- v3: New patch suggested by Pravin. net/openvswitch/vport.c | 38 ++ 1 file changed, 14 insertions(+), 24 deletions(-) diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index b6c8524..076b39f 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -464,27 +464,8 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff *skb, return 0; } -static unsigned int packet_length(const struct sk_buff *skb, - struct net_device *dev) -{ - unsigned int length = skb->len - dev->hard_header_len; - - if (!skb_vlan_tag_present(skb) && - eth_type_vlan(skb->protocol)) - length -= VLAN_HLEN; - - /* Don't subtract for multiple VLAN tags. Most (all?) drivers allow -* (ETH_LEN + VLAN_HLEN) in addition to the mtu value, but almost none -* account for 802.1ad. e.g. is_skb_forwardable(). -*/ - - return length; -} - void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto) { - int mtu = vport->dev->mtu; - switch (vport->dev->type) { case ARPHRD_NONE: if (mac_proto == MAC_PROTO_ETHERNET) { @@ -504,11 +485,20 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto) goto drop; } - if (unlikely(packet_length(skb, vport->dev) > mtu && -!skb_is_gso(skb))) { - net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n", -vport->dev->name, -packet_length(skb, vport->dev), mtu); + if (unlikely(!is_skb_forwardable(vport->dev, skb))) { + /* Log only if the device is up. */ + if (vport->dev->flags & IFF_UP) { + unsigned int length = skb->len + - vport->dev->hard_header_len; + + if (!skb_vlan_tag_present(skb) + && eth_type_vlan(skb->protocol)) + length -= VLAN_HLEN; + + net_warn_ratelimited("%s: dropped over-mtu packet %d > %d\n", +vport->dev->name, length, +vport->dev->mtu); + } vport->dev->stats.tx_errors++; goto drop; } -- 2.1.4
[PATCH v3 net-next 1/3] openvswitch: Add a missing break statement.
Add a break statement to prevent fall-through from OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL. Without the break actions setting ethernet addresses fail to validate with log messages complaining about invalid tunnel attributes. Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> Acked-by: Jiri Benc <jb...@redhat.com> --- v3: No change. net/openvswitch/flow_netlink.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index d19044f..c87d359 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -2195,6 +2195,7 @@ static int validate_set(const struct nlattr *a, case OVS_KEY_ATTR_ETHERNET: if (mac_proto != MAC_PROTO_ETHERNET) return -EINVAL; + break; case OVS_KEY_ATTR_TUNNEL: if (masked) -- 2.1.4
[PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.
Do not always set skb->protocol to be the ethertype of the L3 header. For a packet with non-accelerated VLAN tags skb->protocol needs to be the ethertype of the outermost non-accelerated VLAN ethertype. Any VLAN offloading is undone on the OVS netlink interface, and any VLAN tags added by userspace are non-accelerated, as are double tagged VLAN packets. Fixes: 018c1dda5f ("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes") Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- v3: Set skb->protocol properly also for double tagged packets, as suggested by Pravin. This patch is no longer needed for the MTU test failure, as the new patch 2/3 addresses that. net/openvswitch/datapath.c | 1 - net/openvswitch/flow.c | 62 +++--- 2 files changed, 31 insertions(+), 32 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 2d4c4d3..9c62b63 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -606,7 +606,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) rcu_assign_pointer(flow->sf_acts, acts); packet->priority = flow->key.phy.priority; packet->mark = flow->key.phy.skb_mark; - packet->protocol = flow->key.eth.type; rcu_read_lock(); dp = get_dp_rcu(net, ovs_header->dp_ifindex); diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 08aa926..e2fe2e5 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -354,6 +354,7 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key) res = parse_vlan_tag(skb, >eth.vlan); if (res <= 0) return res; + skb->protocol = key->eth.vlan.tpid; } /* Parse inner vlan tag. */ @@ -361,6 +362,11 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key) if (res <= 0) return res; + /* If the outer vlan tag was accelerated, skb->protocol should +* refelect the inner vlan type. */ + if (!eth_type_vlan(skb->protocol)) + skb->protocol = key->eth.cvlan.tpid; + return 0; } @@ -477,12 +483,18 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, } /** - * key_extract - extracts a flow key from an Ethernet frame. + * key_extract - extracts a flow key from a packet with or without an + * Ethernet header. * @skb: sk_buff that contains the frame, with skb->data pointing to the - * Ethernet header + * beginning of the packet. * @key: output flow key * - * The caller must ensure that skb->len >= ETH_HLEN. + * 'key->mac_proto' must be initialized to indicate the frame type. For an L3 + * frame 'key->mac_proto' must equal 'MAC_PROTO_NONE', and the caller must + * ensure that 'skb->protocol' is set to the ethertype of the L3 header. + * Otherwise the presence of an Ethernet header is assumed and the caller must + * ensure that skb->len >= ETH_HLEN and that 'skb->protocol' is initialized to + * zero. * * Returns 0 if successful, otherwise a negative errno value. * @@ -498,8 +510,9 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, * of a correct length, otherwise the same as skb->network_header. * For other key->eth.type values it is left untouched. * - *- skb->protocol: the type of the data starting at skb->network_header. - * Equals to key->eth.type. + *- skb->protocol: For non-accelerated VLAN, one of the VLAN ether types, + * otherwise the same as key->eth.type, the ether type of the payload + * starting at skb->network_header. */ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) { @@ -531,15 +544,16 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) if (unlikely(parse_vlan(skb, key))) return -ENOMEM; - skb->protocol = parse_ethertype(skb); - if (unlikely(skb->protocol == htons(0))) + key->eth.type = parse_ethertype(skb); + if (unlikely(key->eth.type == htons(0))) return -ENOMEM; skb_reset_network_header(skb); __skb_push(skb, skb->data - skb_mac_header(skb)); } skb_reset_mac_len(skb); - key->eth.type = skb->protocol; + if (!eth_type_vlan(skb->protocol)) + skb->protocol = key->eth.type; /* Network layer. */ if (key->eth.type == htons(ETH_P_IP)) { @@ -800,29 +814,15 @@ int ovs_flow_key_extract_userspace(struct net *net, const struct nlattr *attr, if (err) return err;
[PATCH v2 net-next 1/2] openvswitch: Add a missing break statement.
Add a break statement to prevent fall-through from OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL. Without the break actions setting ethernet addresses fail to validate with log messages complaining about invalid tunnel attributes. Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> Acked-by: Pravin B Shelar <pshe...@ovn.org> Acked-by: Jiri Benc <jb...@redhat.com> --- v2: No change. net/openvswitch/flow_netlink.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index d19044f..c87d359 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -2195,6 +2195,7 @@ static int validate_set(const struct nlattr *a, case OVS_KEY_ATTR_ETHERNET: if (mac_proto != MAC_PROTO_ETHERNET) return -EINVAL; + break; case OVS_KEY_ATTR_TUNNEL: if (masked) -- 2.1.4
[PATCH v2 net-next 2/2] openvswitch: Fix skb->protocol for vlan frames.
Do not set skb->protocol to be the ethertype of the L3 header, unless the packet only has the L3 header. For a non-hardware offloaded VLAN frame skb->protocol needs to be one of the VLAN ethertypes. Any VLAN offloading is undone on the OVS netlink interface. Also any VLAN tags added by userspace are non-offloaded. Incorrect skb->protocol value on a full-size non-offloaded VLAN skb causes packet drop due to failing MTU check, as the VLAN header should not be counted in when considering MTU in ovs_vport_send(). Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- v2: Set skb->protocol when an ETH_P_TEB frame is received via ARPHRD_NONE interface. net/openvswitch/datapath.c | 1 - net/openvswitch/flow.c | 30 ++ 2 files changed, 22 insertions(+), 9 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 2d4c4d3..9c62b63 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -606,7 +606,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) rcu_assign_pointer(flow->sf_acts, acts); packet->priority = flow->key.phy.priority; packet->mark = flow->key.phy.skb_mark; - packet->protocol = flow->key.eth.type; rcu_read_lock(); dp = get_dp_rcu(net, ovs_header->dp_ifindex); diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 08aa926..b9aae99 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -477,12 +477,17 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, } /** - * key_extract - extracts a flow key from an Ethernet frame. + * key_extract - extracts a flow key from a packet with or without an + * Ethernet header. * @skb: sk_buff that contains the frame, with skb->data pointing to the - * Ethernet header + * beginning of the packet. * @key: output flow key * - * The caller must ensure that skb->len >= ETH_HLEN. + * 'key->mac_proto' must be initialized to indicate the frame type. + * For an L3 frame 'key->mac_proto' must equal 'MAC_PROTO_NONE', and the + * caller must ensure that 'skb->protocol' is set to the ethertype of the L3 + * header. Otherwise the presence of an Ethernet header is assumed and + * the caller must ensure that skb->len >= ETH_HLEN. * * Returns 0 if successful, otherwise a negative errno value. * @@ -498,8 +503,9 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, * of a correct length, otherwise the same as skb->network_header. * For other key->eth.type values it is left untouched. * - *- skb->protocol: the type of the data starting at skb->network_header. - * Equals to key->eth.type. + *- skb->protocol: For non-accelerated VLAN, one of the VLAN ether types, + * otherwise the same as key->eth.type, the ether type of the payload + * starting at skb->network_header. */ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) { @@ -518,6 +524,7 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) return -EINVAL; skb_reset_network_header(skb); + key->eth.type = skb->protocol; } else { eth = eth_hdr(skb); ether_addr_copy(key->eth.src, eth->h_source); @@ -531,15 +538,22 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) if (unlikely(parse_vlan(skb, key))) return -ENOMEM; - skb->protocol = parse_ethertype(skb); - if (unlikely(skb->protocol == htons(0))) + key->eth.type = parse_ethertype(skb); + if (unlikely(key->eth.type == htons(0))) return -ENOMEM; + if (skb->protocol == htons(ETH_P_TEB)) { + if (key->eth.vlan.tci & htons(VLAN_TAG_PRESENT) + && !skb_vlan_tag_present(skb)) + skb->protocol = key->eth.vlan.tpid; + else + skb->protocol = key->eth.type; + } + skb_reset_network_header(skb); __skb_push(skb, skb->data - skb_mac_header(skb)); } skb_reset_mac_len(skb); - key->eth.type = skb->protocol; /* Network layer. */ if (key->eth.type == htons(ETH_P_IP)) { -- 2.1.4
Re: [PATCH net-next 2/2] openvswitch: Fix skb->protocol for vlan frames.
> On Nov 28, 2016, at 2:42 PM, Jiri Benc <jb...@redhat.com> wrote: > > On Mon, 28 Nov 2016 14:29:39 -0800, Jarno Rajahalme wrote: >> I’m not sure what you suggest here. Obviously the kernel ABI can not >> be changed as existing userspace code expects upcalled packets to be >> non-accelerated. Also, if userspace pushes vlan headers, the packet >> will actually have them. > > The user space API needs to be preserved, of course. I'm talking about > what happens internally in the kernel. > > See this patchset: https://www.spinics.net/lists/netdev/msg398827.html > I did not try to apply this series yet, but given the recent L3 changes maybe it needs a rebase? >> Would this incremental fix this: >> >> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c >> index 9be9fda..37f1bb9 100644 >> --- a/net/openvswitch/flow.c >> +++ b/net/openvswitch/flow.c >> @@ -354,6 +354,8 @@ static int parse_vlan(struct sk_buff *skb, struct >> sw_flow_key *key) res = parse_vlan_tag(skb, >eth.vlan); >> if (res <= 0) >> return res; >> +if (skb->protocol == htons(ETH_P_TEB)) >> +skb->protocol = key->eth.vlan.tpid; >> } >> >> /* Parse inner vlan tag. */ > > I'll look at this tomorrow. But it seems we're adding more and more > hacks instead of cleaning up the vlan handling. > Right, I just noticed that the incremental only handles the VLAN case. I’ll post a v2 later today with a proper fix that solves the immediate issue. IMO this should be fixed independently of the VLAN handling series for which I have no informed opinion yet. Jarno > Jiri
Re: [PATCH net-next 2/2] openvswitch: Fix skb->protocol for vlan frames.
> On Nov 24, 2016, at 8:10 AM, Jiri Benc <jb...@redhat.com> wrote: > > On Tue, 22 Nov 2016 20:09:34 -0800, Jarno Rajahalme wrote: >> Do not set skb->protocol to be the ethertype of the L3 header, unless >> the packet only has the L3 header. For a non-hardware offloaded VLAN >> frame skb->protocol needs to be one of the VLAN ethertypes. >> >> Any VLAN offloading is undone on the OVS netlink interface. Due to >> this all VLAN packets sent to openvswitch module from userspace are >> non-offloaded. > > This is exactly why I wanted to always accelerate the vlan tag, the > same way it is done in other parts of the networking stack: to prevent > all those weird corner cases. > > Looks to me this is the only real way forward. > I’m not sure what you suggest here. Obviously the kernel ABI can not be changed as existing userspace code expects upcalled packets to be non-accelerated. Also, if userspace pushes vlan headers, the packet will actually have them. > This patch is wrong, it would leave skb->protocol as ETH_P_TEB for L2 > frames received via ARPHRD_NONE interface. > Would this incremental fix this: diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 9be9fda..37f1bb9 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -354,6 +354,8 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key) res = parse_vlan_tag(skb, >eth.vlan); if (res <= 0) return res; + if (skb->protocol == htons(ETH_P_TEB)) + skb->protocol = key->eth.vlan.tpid; } /* Parse inner vlan tag. */ Jarno
[PATCH net-next 1/2] openvswitch: Add a missing break statement.
Add a break statement to prevent fall-through from OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL. Without the break actions setting ethernet addresses fail to validate with log messages complaining about invalid tunnel attributes. Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/flow_netlink.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index d19044f..c87d359 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -2195,6 +2195,7 @@ static int validate_set(const struct nlattr *a, case OVS_KEY_ATTR_ETHERNET: if (mac_proto != MAC_PROTO_ETHERNET) return -EINVAL; + break; case OVS_KEY_ATTR_TUNNEL: if (masked) -- 2.1.4
[PATCH net-next 2/2] openvswitch: Fix skb->protocol for vlan frames.
Do not set skb->protocol to be the ethertype of the L3 header, unless the packet only has the L3 header. For a non-hardware offloaded VLAN frame skb->protocol needs to be one of the VLAN ethertypes. Any VLAN offloading is undone on the OVS netlink interface. Due to this all VLAN packets sent to openvswitch module from userspace are non-offloaded. Incorrect skb->protocol value on a full-size non-offloaded VLAN skb causes packet drop due to failing MTU check, as the VLAN header should not be counted in when considering MTU in ovs_vport_send(). Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/datapath.c | 1 - net/openvswitch/flow.c | 20 +++- 2 files changed, 11 insertions(+), 10 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 2d4c4d3..9c62b63 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -606,7 +606,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) rcu_assign_pointer(flow->sf_acts, acts); packet->priority = flow->key.phy.priority; packet->mark = flow->key.phy.skb_mark; - packet->protocol = flow->key.eth.type; rcu_read_lock(); dp = get_dp_rcu(net, ovs_header->dp_ifindex); diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 08aa926..9be9fda 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -477,12 +477,17 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, } /** - * key_extract - extracts a flow key from an Ethernet frame. + * key_extract - extracts a flow key from a packet with or without an + * Ethernet header. * @skb: sk_buff that contains the frame, with skb->data pointing to the - * Ethernet header + * beginning of the packet. * @key: output flow key * - * The caller must ensure that skb->len >= ETH_HLEN. + * 'key->mac_proto' must be initialized to indicate the frame type. + * For an L3 frame 'key->mac_proto' must equal 'MAC_PROTO_NONE', and the + * caller must ensure that 'skb->protocol' is set to the ethertype of the L3 + * header. Otherwise the presence of an Ethernet header is assumed and + * the caller must ensure that skb->len >= ETH_HLEN. * * Returns 0 if successful, otherwise a negative errno value. * @@ -497,9 +502,6 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, * on output, then just past the IP header, if one is present and * of a correct length, otherwise the same as skb->network_header. * For other key->eth.type values it is left untouched. - * - *- skb->protocol: the type of the data starting at skb->network_header. - * Equals to key->eth.type. */ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) { @@ -518,6 +520,7 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) return -EINVAL; skb_reset_network_header(skb); + key->eth.type = skb->protocol; } else { eth = eth_hdr(skb); ether_addr_copy(key->eth.src, eth->h_source); @@ -531,15 +534,14 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) if (unlikely(parse_vlan(skb, key))) return -ENOMEM; - skb->protocol = parse_ethertype(skb); - if (unlikely(skb->protocol == htons(0))) + key->eth.type = parse_ethertype(skb); + if (unlikely(key->eth.type == htons(0))) return -ENOMEM; skb_reset_network_header(skb); __skb_push(skb, skb->data - skb_mac_header(skb)); } skb_reset_mac_len(skb); - key->eth.type = skb->protocol; /* Network layer. */ if (key->eth.type == htons(ETH_P_IP)) { -- 2.1.4
Re: Long delays creating a netns after deleting one (possibly RCU related)
> On Nov 14, 2016, at 3:09 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > > On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote: >> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote: >> >>> synchronize_rcu_expidited is not enough if you have multiple network >>> devices in play. >>> >>> Looking at the code it comes down to this commit, and it appears there >>> is a promise add rcu grace period combining by Eric Dumazet. >>> >>> Eric since people are hitting noticable stalls because of the rcu grace >>> period taking a long time do you think you could look at this code path >>> a bit more? >>> >>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9 >>> Author: Eric Dumazet <eduma...@google.com> >>> Date: Wed Nov 18 06:31:03 2015 -0800 >> >> Absolutely, I will take a loop asap. > > The worst offender should be fixed by the following patch. > > busy poll needs to poll the physical device, not a virtual one... > > diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h > index > d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 > 100644 > --- a/include/net/gro_cells.h > +++ b/include/net/gro_cells.h > @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, > struct net_device *de > struct gro_cell *cell = per_cpu_ptr(gcells->cells, i); > > __skb_queue_head_init(>napi_skbs); > + > + set_bit(NAPI_STATE_NO_BUSY_POLL, >napi.state); > + > netif_napi_add(dev, >napi, gro_cell_poll, 64); > napi_enable(>napi); > } > > > > > This fixes the problem for me, so for whatever it’s worth: Tested-by: Jarno Rajahalme <ja...@ovn.org>
[PATCH net-next v2 3/5] virtio_net: Do not clear memory for struct virtio_net_hdr twice.
virtio_net_hdr_from_skb() clears the memory for the header, so there is no point for the callers to do the same. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- drivers/net/tun.c | 3 +-- include/linux/virtio_net.h | 2 +- net/packet/af_packet.c | 2 -- 3 files changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 3b8d8cc..64e694c 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1360,8 +1360,7 @@ static ssize_t tun_put_user(struct tun_struct *tun, } if (vnet_hdr_sz) { - struct virtio_net_hdr gso = { 0 }; /* no info leak */ - int ret; + struct virtio_net_hdr gso; if (iov_iter_count(iter) < vnet_hdr_sz) return -EINVAL; diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 74f1e33..6620400 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -58,7 +58,7 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, struct virtio_net_hdr *hdr, bool little_endian) { - memset(hdr, 0, sizeof(*hdr)); + memset(hdr, 0, sizeof(*hdr)); /* no info leak */ if (skb_is_gso(skb)) { struct skb_shared_info *sinfo = skb_shinfo(skb); diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index d2238b2..abe6c0b 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1970,8 +1970,6 @@ static unsigned int run_filter(struct sk_buff *skb, static int __packet_rcv_vnet(const struct sk_buff *skb, struct virtio_net_hdr *vnet_hdr) { - *vnet_hdr = (const struct virtio_net_hdr) { 0 }; - if (virtio_net_hdr_from_skb(skb, vnet_hdr, vio_le())) BUG(); -- 2.1.4
[PATCH net-next v2 4/5] af_packet: Use virtio_net_hdr_to_skb().
Use the common virtio_net_hdr_to_skb() instead of open coding it. Other call sites were changed by commit fd2a0437dc, but this one was missed, maybe because it is split in two parts of the source code. Interim comparisons of 'vnet_hdr->gso_type' still work as both the vnet_hdr and skb notion of gso_type is zero when there is no gso. Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/packet/af_packet.c | 51 +++--- 1 file changed, 3 insertions(+), 48 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index abe6c0b..1816b77 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2388,8 +2388,6 @@ static void tpacket_set_protocol(const struct net_device *dev, static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len) { - unsigned short gso_type = 0; - if ((vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && (__virtio16_to_cpu(vio_le(), vnet_hdr->csum_start) + __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset) + 2 > @@ -2401,29 +2399,6 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len) if (__virtio16_to_cpu(vio_le(), vnet_hdr->hdr_len) > len) return -EINVAL; - if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { - switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) { - case VIRTIO_NET_HDR_GSO_TCPV4: - gso_type = SKB_GSO_TCPV4; - break; - case VIRTIO_NET_HDR_GSO_TCPV6: - gso_type = SKB_GSO_TCPV6; - break; - case VIRTIO_NET_HDR_GSO_UDP: - gso_type = SKB_GSO_UDP; - break; - default: - return -EINVAL; - } - - if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) - gso_type |= SKB_GSO_TCP_ECN; - - if (vnet_hdr->gso_size == 0) - return -EINVAL; - } - - vnet_hdr->gso_type = gso_type; /* changes type, temporary storage */ return 0; } @@ -2443,27 +2418,6 @@ static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len, return __packet_snd_vnet_parse(vnet_hdr, *len); } -static int packet_snd_vnet_gso(struct sk_buff *skb, - struct virtio_net_hdr *vnet_hdr) -{ - if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { - u16 s = __virtio16_to_cpu(vio_le(), vnet_hdr->csum_start); - u16 o = __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset); - - if (!skb_partial_csum_set(skb, s, o)) - return -EINVAL; - } - - skb_shinfo(skb)->gso_size = - __virtio16_to_cpu(vio_le(), vnet_hdr->gso_size); - skb_shinfo(skb)->gso_type = vnet_hdr->gso_type; - - /* Header must be checked, and gso_segs computed. */ - skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; - skb_shinfo(skb)->gso_segs = 0; - return 0; -} - static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb, void *frame, struct net_device *dev, void *data, int tp_len, __be16 proto, unsigned char *addr, int hlen, int copylen, @@ -2723,7 +2677,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) } } - if (po->has_vnet_hdr && packet_snd_vnet_gso(skb, vnet_hdr)) { + if (po->has_vnet_hdr && virtio_net_hdr_to_skb(skb, vnet_hdr, + vio_le())) { tp_len = -EINVAL; goto tpacket_error; } @@ -2914,7 +2869,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) packet_pick_tx_queue(dev, skb); if (po->has_vnet_hdr) { - err = packet_snd_vnet_gso(skb, _hdr); + err = virtio_net_hdr_to_skb(skb, _hdr, vio_le()); if (err) goto out_free; len += sizeof(vnet_hdr); -- 2.1.4
[PATCH net-next v2 2/5] virtio_net.h: Fix comment.
Fix incorrent comment after the final #endif. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- include/linux/virtio_net.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 1c912f8..74f1e33 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -98,4 +98,4 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, return 0; } -#endif /* _LINUX_VIRTIO_BYTEORDER */ +#endif /* _LINUX_VIRTIO_NET_H */ -- 2.1.4
[PATCH net-next v2 5/5] af_packet: Use virtio_net_hdr_from_skb() directly.
Remove static function __packet_rcv_vnet(), which only called virtio_net_hdr_from_skb() and BUG()ged out if an error code was returned. Instead, call virtio_net_hdr_from_skb() from the former call sites of __packet_rcv_vnet() and actually use the error handling code that is already there. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/packet/af_packet.c | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 1816b77..fab9bbf 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1967,15 +1967,6 @@ static unsigned int run_filter(struct sk_buff *skb, return res; } -static int __packet_rcv_vnet(const struct sk_buff *skb, -struct virtio_net_hdr *vnet_hdr) -{ - if (virtio_net_hdr_from_skb(skb, vnet_hdr, vio_le())) - BUG(); - - return 0; -} - static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb, size_t *len) { @@ -1985,7 +1976,7 @@ static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb, return -EINVAL; *len -= sizeof(vnet_hdr); - if (__packet_rcv_vnet(skb, _hdr)) + if (virtio_net_hdr_from_skb(skb, _hdr, vio_le())) return -EINVAL; return memcpy_to_msg(msg, (void *)_hdr, sizeof(vnet_hdr)); @@ -2244,8 +2235,9 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, spin_unlock(>sk_receive_queue.lock); if (po->has_vnet_hdr) { - if (__packet_rcv_vnet(skb, h.raw + macoff - - sizeof(struct virtio_net_hdr))) { + if (virtio_net_hdr_from_skb(skb, h.raw + macoff - + sizeof(struct virtio_net_hdr), + vio_le())) { spin_lock(>sk_receive_queue.lock); goto drop_n_account; } -- 2.1.4
[PATCH net-next v2 1/5] virtio_net: Simplify call sites for virtio_net_hdr_{from,to}_skb().
No point storing the return value of virtio_net_hdr_to_skb() or virtio_net_hdr_from_skb() to a variable when the value is used only once as a boolean in an immediately following if statement. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- drivers/net/macvtap.c | 5 ++--- drivers/net/tun.c | 8 +++- 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 070e329..5da9861 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -821,9 +821,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, if (iov_iter_count(iter) < vnet_hdr_len) return -EINVAL; - ret = virtio_net_hdr_from_skb(skb, _hdr, - macvtap_is_little_endian(q)); - if (ret) + if (virtio_net_hdr_from_skb(skb, _hdr, + macvtap_is_little_endian(q))) BUG(); if (copy_to_iter(_hdr, sizeof(vnet_hdr), iter) != diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 1588469..3b8d8cc 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1252,8 +1252,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, return -EFAULT; } - err = virtio_net_hdr_to_skb(skb, , tun_is_little_endian(tun)); - if (err) { + if (virtio_net_hdr_to_skb(skb, , tun_is_little_endian(tun))) { this_cpu_inc(tun->pcpu_stats->rx_frame_errors); kfree_skb(skb); return -EINVAL; @@ -1367,9 +1366,8 @@ static ssize_t tun_put_user(struct tun_struct *tun, if (iov_iter_count(iter) < vnet_hdr_sz) return -EINVAL; - ret = virtio_net_hdr_from_skb(skb, , - tun_is_little_endian(tun)); - if (ret) { + if (virtio_net_hdr_from_skb(skb, , + tun_is_little_endian(tun))) { struct skb_shared_info *sinfo = skb_shinfo(skb); pr_err("unexpected GSO type: " "0x%x, gso_size %d, hdr_len %d\n", -- 2.1.4
Re: [net-next] af_packet: Use virtio_net_hdr_to_skb().
Sorry for my transgressions and wasting your time. I’ll send a v2 in a moment. Jarno > On Nov 18, 2016, at 8:35 AM, David Miller <da...@davemloft.net> wrote: > > From: Jarno Rajahalme <ja...@ovn.org> > Date: Wed, 16 Nov 2016 18:06:42 -0800 > >> Use the common virtio_net_hdr_to_skb() instead of open coding it. >> Other call sites were changed by commit fd2a0437dc, but this one was >> missed, maybe because it is split in two parts of the source code. >> >> Also fix other call sites to be more uniform. >> >> Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb") >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > > This patch is doing many more things that just this. > > Do not mix unrelated changes together: > >> @@ -821,9 +821,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, >> if (iov_iter_count(iter) < vnet_hdr_len) >> return -EINVAL; >> >> -ret = virtio_net_hdr_from_skb(skb, _hdr, >> - macvtap_is_little_endian(q)); >> -if (ret) >> +if (virtio_net_hdr_from_skb(skb, _hdr, >> +macvtap_is_little_endian(q))) >> BUG(); >> >> if (copy_to_iter(_hdr, sizeof(vnet_hdr), iter) != > > This has nothing to do with modifying code to use vrtio_net_hdr_to_skb(), it > doesn't belong in this patch. > >> @@ -1361,15 +1360,12 @@ static ssize_t tun_put_user(struct tun_struct *tun, >> } >> >> if (vnet_hdr_sz) { >> -struct virtio_net_hdr gso = { 0 }; /* no info leak */ >> -int ret; >> - >> +struct virtio_net_hdr gso; > > This is _extremely_ opaque. The initializer is trying to prevent kernel > memory > info leaks onto the network or into user space. > > Maybe this transformation is valid but: > > 1) YOU DON'T EVEN MENTION IT IN YOUR COMMIT MESSAGE. > > 2) It's unrelated to this specific change, therefore it belongs in > a separate change. > > 3) You don't explain that it is a valid transformation, not why. > > It is extremely disappointing to catch unrelated, potentially far > reaching things embedded in a patch when I review it. > > Please do not ever do this. > >> @@ -98,4 +98,4 @@ static inline int virtio_net_hdr_from_skb(const struct >> sk_buff *skb, >> return 0; >> } >> >> -#endif /* _LINUX_VIRTIO_BYTEORDER */ >> +#endif /* _LINUX_VIRTIO_NET_H */ > > Another unrelated change. > >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >> index 11db0d6..09abb88 100644 >> --- a/net/packet/af_packet.c >> +++ b/net/packet/af_packet.c >> @@ -1971,8 +1971,6 @@ static unsigned int run_filter(struct sk_buff *skb, >> static int __packet_rcv_vnet(const struct sk_buff *skb, >> struct virtio_net_hdr *vnet_hdr) >> { >> -*vnet_hdr = (const struct virtio_net_hdr) { 0 }; >> - > > There is no way this belongs in this patch, and again you do not explain > why removing this initializer is valid.
Re: Long delays creating a netns after deleting one (possibly RCU related)
> On Nov 14, 2016, at 3:09 PM, Eric Dumazetwrote: > > On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote: >> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote: >> >>> synchronize_rcu_expidited is not enough if you have multiple network >>> devices in play. >>> >>> Looking at the code it comes down to this commit, and it appears there >>> is a promise add rcu grace period combining by Eric Dumazet. >>> >>> Eric since people are hitting noticable stalls because of the rcu grace >>> period taking a long time do you think you could look at this code path >>> a bit more? >>> >>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9 >>> Author: Eric Dumazet >>> Date: Wed Nov 18 06:31:03 2015 -0800 >> >> Absolutely, I will take a loop asap. > > The worst offender should be fixed by the following patch. > > busy poll needs to poll the physical device, not a virtual one... > > diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h > index > d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 > 100644 > --- a/include/net/gro_cells.h > +++ b/include/net/gro_cells.h > @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, > struct net_device *de > struct gro_cell *cell = per_cpu_ptr(gcells->cells, i); > > __skb_queue_head_init(>napi_skbs); > + > + set_bit(NAPI_STATE_NO_BUSY_POLL, >napi.state); > + > netif_napi_add(dev, >napi, gro_cell_poll, 64); > napi_enable(>napi); > } > This solved a ~20 second slowdown between OVS datapath unit tests for me. Jarno
Re: Virtio_net support vxlan encapsulation package TSO offload discuss
I worked on the same issue a few months back. I rebased my proof-of-concept code to the current net-next and posted an RFC patch a moment ago. I have zero experience on QEMU feature negotiation or extending the virtio_net spec. Since the virtio_net handling code is now all done using shared code, this should work for macvtap as well, not sure if macvtap needs some control plane changes. I posted a separate patch to make af_packet also use the shared infra for virtio_net handling yesterday. My RFC patch assumes that af_packet need not be touched, i.e., assumes the af_packet patch is applied, even though the patches apply to net-next in either order. Jarno > On Nov 16, 2016, at 11:27 PM, Jason Wangwrote: > > > > On 2016年11月17日 09:31, Zhangming (James, Euler) wrote: >> On 2016年11月15日 11:28, Jason Wang wrote: >>> On 2016年11月10日 14:19, Zhangming (James, Euler) wrote: On 2016年11月09日 15:14, Jason Wang wrote: > On 2016年11月08日 19:58, Zhangming (James, Euler) wrote: >> On 2016年11月08日 19:17, Jason Wang wrote: >> >>> On 2016年11月08日 19:13, Jason Wang wrote: Cc Michael On 2016年11月08日 16:34, Zhangming (James, Euler) wrote: > In container scenario, OVS is installed in the Virtual machine, > and all the containers connected to the OVS will communicated > through VXLAN encapsulation. > > By now, virtio_net does not support TSO offload for VXLAN > encapsulated TSO package. In this condition, the performance is > not good, sender is bottleneck > > I googled this scenario, but I didn’t find any information. Will > virtio_net support VXLAN encapsulation package TSO offload later? > Yes and for both sender and receiver. > My idea is virtio_net open encapsulated TSO offload, and > transport encapsulation info to TUN, TUN will parse the info and > build skb with encapsulation info. > > OVS or kernel on the host should be modified to support this. > Using this method, the TCP performance aremore than 2x as before. > > Any advice and suggestions for this idea or new idea will be > greatly appreciated! > > Best regards, > > James zhang > Sounds very good. And we may also need features bits (VIRTIO_NET_F_GUEST|HOST_GSO_X) for this. This is in fact one of items in networking todo list. (See http://www.linux-kvm.org/page/NetworkingTodo). While at it, we'd better support not only VXLAN but also other tunnels. >>> Cc Vlad who is working on extending virtio-net headers. >>> We can start with the spec work, or if you've already had some bits you can post them as RFC for early review. Thanks >> Below is my demo code >> Virtio_net.c >> static int virtnet_probe(struct virtio_device *vdev), add belows codes: >> if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) || >> // avoid gso segment, it should be negotiation >> later, because in the demo I reuse num_buffers. >> virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) { >> dev->hw_enc_features |= NETIF_F_TSO; >> dev->hw_enc_features |= NETIF_F_ALL_CSUM; >> dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL; >> dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM; >> dev->hw_enc_features |= >> NETIF_F_GSO_TUNNEL_REMCSUM; >> >> dev->features |= NETIF_F_GSO_UDP_TUNNEL; >> dev->features |= NETIF_F_GSO_UDP_TUNNEL_CSUM; >> dev->features |= NETIF_F_GSO_TUNNEL_REMCSUM; >> } >> >> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb), add >> below to pieces of codes >> >> if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL) >> hdr->hdr.gso_type |= VIRTIO_NET_HDR_GSO_TUNNEL; >> if (skb_shinfo(skb)->gso_type & >> SKB_GSO_UDP_TUNNEL_CSUM) >> hdr->hdr.gso_type |= >> VIRTIO_NET_HDR_GSO_TUNNEL_CSUM; >> if (skb_shinfo(skb)->gso_type & SKB_GSO_TUNNEL_REMCSUM) >> hdr->hdr.gso_type |= >> VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM; >> >> if (skb->encapsulation && skb_is_gso(skb)) { >> inner_mac_len = skb_inner_network_header(skb) - >> skb_inner_mac_header(skb); >> tnl_len = skb_inner_mac_header(skb) - >> skb_mac_header(skb); >> if ( !(inner_mac_len >> DATA_LEN_SHIFT) && !(tnl_len >> >> DATA_LEN_SHIFT) ) { >>
[RFC PATCH net-next] virtio_net: Support UDP Tunnel offloads.
This patch is a proof-of-concept I did a few months ago for UDP tunnel offload support in virtio_net interface, and rebased on to the current net-next. Real implementation needs to extend the virtio_net header rather than piggy-backing on existing fields. Inner MAC length (or inner network offset) also needs to be passed as a new field. Control plane (QEMU) also needs to be updated. All testing was done using Geneve, but this should work for all UDP tunnels the same. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- drivers/net/tun.c | 7 - drivers/net/virtio_net.c| 16 +++--- include/linux/skbuff.h | 5 include/linux/virtio_net.h | 66 ++--- include/uapi/linux/virtio_net.h | 7 + 5 files changed, 78 insertions(+), 23 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 1588469..36f3219 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -198,7 +198,9 @@ struct tun_struct { struct net_device *dev; netdev_features_t set_features; #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \ - NETIF_F_TSO6|NETIF_F_UFO) + NETIF_F_TSO6|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL| \ + NETIF_F_GSO_UDP_TUNNEL_CSUM| \ + NETIF_F_GSO_TUNNEL_REMCSUM) int align; int vnet_hdr_sz; @@ -1877,6 +1879,9 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) if (arg & TUN_F_UFO) { features |= NETIF_F_UFO; +#if 1 + features |= NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM; +#endif arg &= ~TUN_F_UFO; } } diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ca5239a..eb8d887 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1789,7 +1789,10 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) { dev->hw_features |= NETIF_F_TSO | NETIF_F_UFO - | NETIF_F_TSO_ECN | NETIF_F_TSO6; + | NETIF_F_TSO_ECN | NETIF_F_TSO6 + | NETIF_F_GSO_UDP_TUNNEL + | NETIF_F_GSO_UDP_TUNNEL_CSUM + | NETIF_F_GSO_TUNNEL_REMCSUM; } /* Individual feature bits: what can host handle? */ if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_TSO4)) @@ -1798,13 +1801,18 @@ static int virtnet_probe(struct virtio_device *vdev) dev->hw_features |= NETIF_F_TSO6; if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_ECN)) dev->hw_features |= NETIF_F_TSO_ECN; - if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO)) + if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO)) { dev->hw_features |= NETIF_F_UFO; - +#if 1 + dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL; + dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM; + dev->hw_features |= NETIF_F_GSO_TUNNEL_REMCSUM; +#endif + } dev->features |= NETIF_F_GSO_ROBUST; if (gso) - dev->features |= dev->hw_features & (NETIF_F_ALL_TSO|NETIF_F_UFO); + dev->features |= dev->hw_features & (NETIF_F_ALL_TSO|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM); /* (!csum && gso) case will be fixed by register_netdev() */ } if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM)) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a4aeeca..992ad30 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2115,6 +2115,11 @@ static inline unsigned char *skb_inner_mac_header(const struct sk_buff *skb) return skb->head + skb->inner_mac_header; } +static inline int skb_inner_mac_offset(const struct sk_buff *skb) +{ + return skb_inner_mac_header(skb) - skb->data; +} + static inline void skb_reset_inner_mac_header(struct sk_buff *skb) { skb->inner_mac_header = skb->data - skb->head; diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 1c912f8..17384d1 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -8,10 +8,19 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, const struct virtio_net_hdr *hdr, bool little_endian) { - unsigned short gso_type = 0; + u16
[PATCH net-next] af_packet: Use virtio_net_hdr_to_skb().
Use the common virtio_net_hdr_to_skb() instead of open coding it. Other call sites were changed by commit fd2a0437dc, but this one was missed, maybe because it is split in two parts of the source code. Also fix other call sites to be more uniform. Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- drivers/net/macvtap.c | 5 ++--- drivers/net/tun.c | 12 include/linux/virtio_net.h | 2 +- net/packet/af_packet.c | 44 +--- 4 files changed, 8 insertions(+), 55 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 070e329..5da9861 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -821,9 +821,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, if (iov_iter_count(iter) < vnet_hdr_len) return -EINVAL; - ret = virtio_net_hdr_from_skb(skb, _hdr, - macvtap_is_little_endian(q)); - if (ret) + if (virtio_net_hdr_from_skb(skb, _hdr, + macvtap_is_little_endian(q))) BUG(); if (copy_to_iter(_hdr, sizeof(vnet_hdr), iter) != diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9328568..864aae3 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1252,8 +1252,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, return -EFAULT; } - err = virtio_net_hdr_to_skb(skb, , tun_is_little_endian(tun)); - if (err) { + if (virtio_net_hdr_to_skb(skb, , tun_is_little_endian(tun))) { this_cpu_inc(tun->pcpu_stats->rx_frame_errors); kfree_skb(skb); return -EINVAL; @@ -1361,15 +1360,12 @@ static ssize_t tun_put_user(struct tun_struct *tun, } if (vnet_hdr_sz) { - struct virtio_net_hdr gso = { 0 }; /* no info leak */ - int ret; - + struct virtio_net_hdr gso; if (iov_iter_count(iter) < vnet_hdr_sz) return -EINVAL; - ret = virtio_net_hdr_from_skb(skb, , - tun_is_little_endian(tun)); - if (ret) { + if (virtio_net_hdr_from_skb(skb, , + tun_is_little_endian(tun))) { struct skb_shared_info *sinfo = skb_shinfo(skb); pr_err("unexpected GSO type: " "0x%x, gso_size %d, hdr_len %d\n", diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 1c912f8..74f1e33 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -98,4 +98,4 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, return 0; } -#endif /* _LINUX_VIRTIO_BYTEORDER */ +#endif /* _LINUX_VIRTIO_NET_H */ diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 11db0d6..09abb88 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1971,8 +1971,6 @@ static unsigned int run_filter(struct sk_buff *skb, static int __packet_rcv_vnet(const struct sk_buff *skb, struct virtio_net_hdr *vnet_hdr) { - *vnet_hdr = (const struct virtio_net_hdr) { 0 }; - if (virtio_net_hdr_from_skb(skb, vnet_hdr, vio_le())) BUG(); @@ -2391,8 +2389,6 @@ static void tpacket_set_protocol(const struct net_device *dev, static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len) { - unsigned short gso_type = 0; - if ((vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && (__virtio16_to_cpu(vio_le(), vnet_hdr->csum_start) + __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset) + 2 > @@ -2404,29 +2400,6 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len) if (__virtio16_to_cpu(vio_le(), vnet_hdr->hdr_len) > len) return -EINVAL; - if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { - switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) { - case VIRTIO_NET_HDR_GSO_TCPV4: - gso_type = SKB_GSO_TCPV4; - break; - case VIRTIO_NET_HDR_GSO_TCPV6: - gso_type = SKB_GSO_TCPV6; - break; - case VIRTIO_NET_HDR_GSO_UDP: - gso_type = SKB_GSO_UDP; - break; - default: - return -EINVAL; - } - - if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) - gso_type |= SKB_GSO
[net-next] af_packet: Use virtio_net_hdr_to_skb().
Use the common virtio_net_hdr_to_skb() instead of open coding it. Other call sites were changed by commit fd2a0437dc, but this one was missed, maybe because it is split in two parts of the source code. Also fix other call sites to be more uniform. Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb") Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- drivers/net/macvtap.c | 5 ++--- drivers/net/tun.c | 12 include/linux/virtio_net.h | 2 +- net/packet/af_packet.c | 44 +--- 4 files changed, 8 insertions(+), 55 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 070e329..5da9861 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -821,9 +821,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, if (iov_iter_count(iter) < vnet_hdr_len) return -EINVAL; - ret = virtio_net_hdr_from_skb(skb, _hdr, - macvtap_is_little_endian(q)); - if (ret) + if (virtio_net_hdr_from_skb(skb, _hdr, + macvtap_is_little_endian(q))) BUG(); if (copy_to_iter(_hdr, sizeof(vnet_hdr), iter) != diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9328568..864aae3 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1252,8 +1252,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, return -EFAULT; } - err = virtio_net_hdr_to_skb(skb, , tun_is_little_endian(tun)); - if (err) { + if (virtio_net_hdr_to_skb(skb, , tun_is_little_endian(tun))) { this_cpu_inc(tun->pcpu_stats->rx_frame_errors); kfree_skb(skb); return -EINVAL; @@ -1361,15 +1360,12 @@ static ssize_t tun_put_user(struct tun_struct *tun, } if (vnet_hdr_sz) { - struct virtio_net_hdr gso = { 0 }; /* no info leak */ - int ret; - + struct virtio_net_hdr gso; if (iov_iter_count(iter) < vnet_hdr_sz) return -EINVAL; - ret = virtio_net_hdr_from_skb(skb, , - tun_is_little_endian(tun)); - if (ret) { + if (virtio_net_hdr_from_skb(skb, , + tun_is_little_endian(tun))) { struct skb_shared_info *sinfo = skb_shinfo(skb); pr_err("unexpected GSO type: " "0x%x, gso_size %d, hdr_len %d\n", diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 1c912f8..74f1e33 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -98,4 +98,4 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, return 0; } -#endif /* _LINUX_VIRTIO_BYTEORDER */ +#endif /* _LINUX_VIRTIO_NET_H */ diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 11db0d6..09abb88 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1971,8 +1971,6 @@ static unsigned int run_filter(struct sk_buff *skb, static int __packet_rcv_vnet(const struct sk_buff *skb, struct virtio_net_hdr *vnet_hdr) { - *vnet_hdr = (const struct virtio_net_hdr) { 0 }; - if (virtio_net_hdr_from_skb(skb, vnet_hdr, vio_le())) BUG(); @@ -2391,8 +2389,6 @@ static void tpacket_set_protocol(const struct net_device *dev, static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len) { - unsigned short gso_type = 0; - if ((vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && (__virtio16_to_cpu(vio_le(), vnet_hdr->csum_start) + __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset) + 2 > @@ -2404,29 +2400,6 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len) if (__virtio16_to_cpu(vio_le(), vnet_hdr->hdr_len) > len) return -EINVAL; - if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { - switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) { - case VIRTIO_NET_HDR_GSO_TCPV4: - gso_type = SKB_GSO_TCPV4; - break; - case VIRTIO_NET_HDR_GSO_TCPV6: - gso_type = SKB_GSO_TCPV6; - break; - case VIRTIO_NET_HDR_GSO_UDP: - gso_type = SKB_GSO_UDP; - break; - default: - return -EINVAL; - } - - if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) - gso_type |= SKB_GSO
[PATCH net] openvswitch: Remove incorrect WARN_ONCE().
ovs_ct_find_existing() issues a warning if an existing conntrack entry classified as IP_CT_NEW is found, with the premise that this should not happen. However, a newly confirmed, non-expected conntrack entry remains IP_CT_NEW as long as no reply direction traffic is seen. This has resulted into somewhat confusing kernel log messages. This patch removes this check and warning. Fixes: 289f2253 ("openvswitch: Find existing conntrack entry after upcall.") Suggested-by: Joe Stringer <j...@ovn.org> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index c644c78..e054a74 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -433,7 +433,6 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, struct nf_conntrack_l4proto *l4proto; struct nf_conntrack_tuple tuple; struct nf_conntrack_tuple_hash *h; - enum ip_conntrack_info ctinfo; struct nf_conn *ct; unsigned int dataoff; u8 protonum; @@ -458,13 +457,8 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, ct = nf_ct_tuplehash_to_ctrack(h); - ctinfo = ovs_ct_get_info(h); - if (ctinfo == IP_CT_NEW) { - /* This should not happen. */ - WARN_ONCE(1, "ovs_ct_find_existing: new packet for %p\n", ct); - } skb->nfct = >ct_general; - skb->nfctinfo = ctinfo; + skb->nfctinfo = ovs_ct_get_info(h); return ct; } -- 2.1.4
Re: [ovs-dev] [PATCH net-next] openvswitch: Only set mark and labels when commiting a connection.
Thanks for the review! > On Jun 21, 2016, at 1:57 PM, Joe Stringer <j...@ovn.org> wrote: > > On 20 June 2016 at 17:19, Jarno Rajahalme <ja...@ovn.org> wrote: >> Only allow setting conntrack mark or labels when the commit flag is >> specified. This makes sure we can not set them before the connection >> has been persisted, as in that case the mark and labels would be lost >> in an event of an userspace upcall. >> >> OVS userspace already requires the commit flag to accept setting >> ct_mark and/or ct_labels. Validate for this on the kernel API. >> >> Finally, set conntrack mark and labels right before committing so that >> the initial conntrack NEW event has the mark and labels. >> >> Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > > The structure of this commit message suggests there are multiple > changes trying to be addressed in one patch. I suggest splitting them > out. > Done for v2 I just sent for net. > In terms of applying the mark and labels before committing the > connection, that's actually the behaviour I would expect if you were > to execute ct(mark=foo,commit). The NEW event should include these > pieces, and should have all along. Right, the v2 patch 1/2 does this. > >> @@ -1145,6 +1155,20 @@ static int parse_ct(const struct nlattr *attr, struct >> ovs_conntrack_info *info, >>} >>} >> >> +#ifdef CONFIG_NF_CONNTRACK_MARK >> + if (!info->commit && info->mark.mask) { >> + OVS_NLERR(log, >> + "Setting conntrack mark requires 'commit' flag."); >> + return -EINVAL; >> + } >> +#endif >> +#ifdef CONFIG_NF_CONNTRACK_LABELS >> + if (!info->commit && labels_nonzero(>labels.mask)) { >> + OVS_NLERR(log, >> + "Setting conntrack labels requires 'commit' >> flag."); >> + return -EINVAL; >> + } >> +#endif > > I'm of mixed minds about this, but I lean towards agreeing with it. On > one hand, it's applying more restrictions on an otherwise fairly loose > interface and if anyone is relying on this behaviour then it would be > surprising to have this restriction introduced. On the other hand, it > doesn't make a lot of sense to set a label/mark but not to commit the > connection. As you say, the behaviour isn't exactly consistent in that > case today anyway: If there was a flow with > actions=ct(mark=foo),recirc() followed by a userspace upcall, then the > mark would be reflected in the flow key but not saved to any persisted > connection. A subsequent ct(commit) after upcall wouldn't persist it, > either. However if there were two flows already in the datapath to do > this, then it /would/ be persisted. Restricting the mark/labels > modification to only if you have the "commit" flag would address that > consistency issue. The OVS userspace enforcing this constraint also > hints that this was an unintentional omission from kernel validation. I separated this out to the v2 patch 2/2. Jarno
[PATCH net v2 1/2] openvswitch: Set mark and labels before confirming.
Set conntrack mark and labels right before committing so that the initial conntrack NEW event has the mark and labels. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- v2: Separate Kernel API change to an RFC patch (2/2). net/openvswitch/conntrack.c | 33 ++--- 1 file changed, 14 insertions(+), 19 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 3d5feed..23fd4fb 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -824,23 +824,6 @@ static int ovs_ct_lookup(struct net *net, struct sw_flow_key *key, return 0; } -/* Lookup connection and confirm if unconfirmed. */ -static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, -const struct ovs_conntrack_info *info, -struct sk_buff *skb) -{ - int err; - - err = __ovs_ct_lookup(net, key, info, skb); - if (err) - return err; - /* This is a no-op if the connection has already been confirmed. */ - if (nf_conntrack_confirm(skb) != NF_ACCEPT) - return -EINVAL; - - return 0; -} - static bool labels_nonzero(const struct ovs_key_ct_labels *labels) { size_t i; @@ -873,21 +856,33 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb, } if (info->commit) - err = ovs_ct_commit(net, key, info, skb); + err = __ovs_ct_lookup(net, key, info, skb); else err = ovs_ct_lookup(net, key, info, skb); if (err) goto err; + /* Apply changes before confirming the connection so that the initial +* conntrack NEW netlink event carries the values given in the CT +* action. +*/ if (info->mark.mask) { err = ovs_ct_set_mark(skb, key, info->mark.value, info->mark.mask); if (err) goto err; } - if (labels_nonzero(>labels.mask)) + if (labels_nonzero(>labels.mask)) { err = ovs_ct_set_labels(skb, key, >labels.value, >labels.mask); + if (err) + goto err; + } + /* This will take care of sending queued events even if the connection +* is already confirmed. +*/ + if (info->commit && nf_conntrack_confirm(skb) != NF_ACCEPT) + err = -EINVAL; err: skb_push(skb, nh_ofs); if (err) -- 2.1.4
[RFC PATCH net v2 2/2] openvswitch: Only set mark and labels with a commit flag.
Only set conntrack mark or labels when the commit flag is specified. This makes sure we can not set them before the connection has been persisted, as in that case the mark and labels would be lost in an event of an userspace upcall. OVS userspace already requires the commit flag to accept setting ct_mark and/or ct_labels. Validate for this in the kernel API. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 76 ++--- 1 file changed, 51 insertions(+), 25 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 23fd4fb..52f3b9b 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -835,6 +835,42 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels) return false; } +/* Lookup connection and confirm if unconfirmed. */ +static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, +const struct ovs_conntrack_info *info, +struct sk_buff *skb) +{ + int err; + + err = __ovs_ct_lookup(net, key, info, skb); + if (err) + return err; + + /* Apply changes before confirming the connection so that the initial +* conntrack NEW netlink event carries the values given in the CT +* action. +*/ + if (info->mark.mask) { + err = ovs_ct_set_mark(skb, key, info->mark.value, + info->mark.mask); + if (err) + return err; + } + if (labels_nonzero(>labels.mask)) { + err = ovs_ct_set_labels(skb, key, >labels.value, + >labels.mask); + if (err) + return err; + } + /* This will take care of sending queued events even if the connection +* is already confirmed. +*/ + if (nf_conntrack_confirm(skb) != NF_ACCEPT) + return -EINVAL; + + return 0; +} + /* Returns 0 on success, -EINPROGRESS if 'skb' is stolen, or other nonzero * value if 'skb' is freed. */ @@ -856,34 +892,10 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb, } if (info->commit) - err = __ovs_ct_lookup(net, key, info, skb); + err = ovs_ct_commit(net, key, info, skb); else err = ovs_ct_lookup(net, key, info, skb); - if (err) - goto err; - /* Apply changes before confirming the connection so that the initial -* conntrack NEW netlink event carries the values given in the CT -* action. -*/ - if (info->mark.mask) { - err = ovs_ct_set_mark(skb, key, info->mark.value, - info->mark.mask); - if (err) - goto err; - } - if (labels_nonzero(>labels.mask)) { - err = ovs_ct_set_labels(skb, key, >labels.value, - >labels.mask); - if (err) - goto err; - } - /* This will take care of sending queued events even if the connection -* is already confirmed. -*/ - if (info->commit && nf_conntrack_confirm(skb) != NF_ACCEPT) - err = -EINVAL; -err: skb_push(skb, nh_ofs); if (err) kfree_skb(skb); @@ -1140,6 +1152,20 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, } } +#ifdef CONFIG_NF_CONNTRACK_MARK + if (!info->commit && info->mark.mask) { + OVS_NLERR(log, + "Setting conntrack mark requires 'commit' flag."); + return -EINVAL; + } +#endif +#ifdef CONFIG_NF_CONNTRACK_LABELS + if (!info->commit && labels_nonzero(>labels.mask)) { + OVS_NLERR(log, + "Setting conntrack labels requires 'commit' flag."); + return -EINVAL; + } +#endif if (rem > 0) { OVS_NLERR(log, "Conntrack attr has %d unknown bytes", rem); return -EINVAL; -- 2.1.4
[PATCH net-next] openvswitch: Only set mark and labels when commiting a connection.
Only allow setting conntrack mark or labels when the commit flag is specified. This makes sure we can not set them before the connection has been persisted, as in that case the mark and labels would be lost in an event of an userspace upcall. OVS userspace already requires the commit flag to accept setting ct_mark and/or ct_labels. Validate for this on the kernel API. Finally, set conntrack mark and labels right before committing so that the initial conntrack NEW event has the mark and labels. Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- net/openvswitch/conntrack.c | 72 ++--- 1 file changed, 48 insertions(+), 24 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 3d5feed..f1612f5 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -824,6 +824,17 @@ static int ovs_ct_lookup(struct net *net, struct sw_flow_key *key, return 0; } +static bool labels_nonzero(const struct ovs_key_ct_labels *labels) +{ + size_t i; + + for (i = 0; i < sizeof(*labels); i++) + if (labels->ct_labels[i]) + return true; + + return false; +} + /* Lookup connection and confirm if unconfirmed. */ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, const struct ovs_conntrack_info *info, @@ -834,24 +845,35 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, err = __ovs_ct_lookup(net, key, info, skb); if (err) return err; - /* This is a no-op if the connection has already been confirmed. */ + + /* As any changes to an unconfirmed connection may be lost due +* to an upcall, we require the presence of the 'commit' flag +* for setting mask and/or labels. Perform the changes before +* confirming the connection so that the initial mark and label +* values are present in the initial CT NEW netlink event. +*/ + if (info->mark.mask) { + err = ovs_ct_set_mark(skb, key, info->mark.value, + info->mark.mask); + if (err) + return err; + } + if (labels_nonzero(>labels.mask)) { + err = ovs_ct_set_labels(skb, key, >labels.value, + >labels.mask); + if (err) + return err; + } + + /* Will take care of sending queued events even if the connection is +* already confirmed. +*/ if (nf_conntrack_confirm(skb) != NF_ACCEPT) return -EINVAL; return 0; } -static bool labels_nonzero(const struct ovs_key_ct_labels *labels) -{ - size_t i; - - for (i = 0; i < sizeof(*labels); i++) - if (labels->ct_labels[i]) - return true; - - return false; -} - /* Returns 0 on success, -EINPROGRESS if 'skb' is stolen, or other nonzero * value if 'skb' is freed. */ @@ -876,19 +898,7 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb, err = ovs_ct_commit(net, key, info, skb); else err = ovs_ct_lookup(net, key, info, skb); - if (err) - goto err; - if (info->mark.mask) { - err = ovs_ct_set_mark(skb, key, info->mark.value, - info->mark.mask); - if (err) - goto err; - } - if (labels_nonzero(>labels.mask)) - err = ovs_ct_set_labels(skb, key, >labels.value, - >labels.mask); -err: skb_push(skb, nh_ofs); if (err) kfree_skb(skb); @@ -1145,6 +1155,20 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, } } +#ifdef CONFIG_NF_CONNTRACK_MARK + if (!info->commit && info->mark.mask) { + OVS_NLERR(log, + "Setting conntrack mark requires 'commit' flag."); + return -EINVAL; + } +#endif +#ifdef CONFIG_NF_CONNTRACK_LABELS + if (!info->commit && labels_nonzero(>labels.mask)) { + OVS_NLERR(log, + "Setting conntrack labels requires 'commit' flag."); + return -EINVAL; + } +#endif if (rem > 0) { OVS_NLERR(log, "Conntrack attr has %d unknown bytes", rem); return -EINVAL; -- 2.1.4
Re: [PATCH net-next] openvswitch: Only set mark and labels when commiting a connection.
The title should have been: openvswitch: Only set mark and labels with a commit flag. This reflects the fact that modifying the mark and/or labels of an existing connection is allowed. The commit flag is still required to do that, though. Jarno > On Jun 20, 2016, at 5:19 PM, Jarno Rajahalme <ja...@ovn.org> wrote: > > Only allow setting conntrack mark or labels when the commit flag is > specified. This makes sure we can not set them before the connection > has been persisted, as in that case the mark and labels would be lost > in an event of an userspace upcall. > > OVS userspace already requires the commit flag to accept setting > ct_mark and/or ct_labels. Validate for this on the kernel API. > > Finally, set conntrack mark and labels right before committing so that > the initial conntrack NEW event has the mark and labels. > > Signed-off-by: Jarno Rajahalme <ja...@ovn.org> > --- > net/openvswitch/conntrack.c | 72 ++--- > 1 file changed, 48 insertions(+), 24 deletions(-) > > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c > index 3d5feed..f1612f5 100644 > --- a/net/openvswitch/conntrack.c > +++ b/net/openvswitch/conntrack.c > @@ -824,6 +824,17 @@ static int ovs_ct_lookup(struct net *net, struct > sw_flow_key *key, > return 0; > } > > +static bool labels_nonzero(const struct ovs_key_ct_labels *labels) > +{ > + size_t i; > + > + for (i = 0; i < sizeof(*labels); i++) > + if (labels->ct_labels[i]) > + return true; > + > + return false; > +} > + > /* Lookup connection and confirm if unconfirmed. */ > static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, >const struct ovs_conntrack_info *info, > @@ -834,24 +845,35 @@ static int ovs_ct_commit(struct net *net, struct > sw_flow_key *key, > err = __ovs_ct_lookup(net, key, info, skb); > if (err) > return err; > - /* This is a no-op if the connection has already been confirmed. */ > + > + /* As any changes to an unconfirmed connection may be lost due > + * to an upcall, we require the presence of the 'commit' flag > + * for setting mask and/or labels. Perform the changes before > + * confirming the connection so that the initial mark and label > + * values are present in the initial CT NEW netlink event. > + */ > + if (info->mark.mask) { > + err = ovs_ct_set_mark(skb, key, info->mark.value, > + info->mark.mask); > + if (err) > + return err; > + } > + if (labels_nonzero(>labels.mask)) { > + err = ovs_ct_set_labels(skb, key, >labels.value, > + >labels.mask); > + if (err) > + return err; > + } > + > + /* Will take care of sending queued events even if the connection is > + * already confirmed. > + */ > if (nf_conntrack_confirm(skb) != NF_ACCEPT) > return -EINVAL; > > return 0; > } > > -static bool labels_nonzero(const struct ovs_key_ct_labels *labels) > -{ > - size_t i; > - > - for (i = 0; i < sizeof(*labels); i++) > - if (labels->ct_labels[i]) > - return true; > - > - return false; > -} > - > /* Returns 0 on success, -EINPROGRESS if 'skb' is stolen, or other nonzero > * value if 'skb' is freed. > */ > @@ -876,19 +898,7 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb, > err = ovs_ct_commit(net, key, info, skb); > else > err = ovs_ct_lookup(net, key, info, skb); > - if (err) > - goto err; > > - if (info->mark.mask) { > - err = ovs_ct_set_mark(skb, key, info->mark.value, > - info->mark.mask); > - if (err) > - goto err; > - } > - if (labels_nonzero(>labels.mask)) > - err = ovs_ct_set_labels(skb, key, >labels.value, > - >labels.mask); > -err: > skb_push(skb, nh_ofs); > if (err) > kfree_skb(skb); > @@ -1145,6 +1155,20 @@ static int parse_ct(const struct nlattr *attr, struct > ovs_conntrack_info *info, > } > } > > +#ifdef CONFIG_NF_CONNTRACK_MARK > + if (!info->commit && info->mark.mask) { > + OVS_NLERR(log, > + "Setting conntrack mark requires 'commit' flag."); > + return -EINVAL; > + } > +#endif > +#ifdef CONFIG_NF_CONNTRACK_LABELS > + if (!info->commit && labels_nonzero(>labels.mask)) { > + OVS_NLERR(log, > + "Setting conntrack labels requires 'commit' flag."); > + return -EINVAL; > + } > +#endif > if (rem > 0) { > OVS_NLERR(log, "Conntrack attr has %d unknown bytes", rem); > return -EINVAL; > -- > 2.1.4 >
Re: [PATCH net] openvswitch: Fix cached ct with helper.
This would result in inconsistent helper assignment if a first CT action assigns a helper and a further CT action tries to assign a different helper; Typically the second helper assignment would be ignored, but if the unconfirmed conntrack entry is lost due to an upcall the second helper assignment would be successful. This is best resolved by allowing helper assignment by a committing CT action only by testing the 'info->commit' flag in addition to the conditions you have there already. It may also be helpful to fail helper assignment without the commit flag in parse_ct(). Jarno > On May 10, 2016, at 11:40 AM, Joe Stringerwrote: > > When using conntrack helpers from OVS, a common configuration is to > perform a lookup without specifying a helper, then go through a > firewalling policy, only to decide to attach a helper afterwards. > > In this case, the initial lookup will cause a ct entry to be attached to > the skb, then the later commit with helper should attach the helper and > confirm the connection. However, the helper attachment has been missing. > If the user has enabled automatic helper attachment, then this issue > will be masked as it will be applied in init_conntrack(). It is also > masked if the action is executed from ovs_packet_cmd_execute() as that > will construct a fresh skb. > > This patch fixes the issue by making an explicit call to try to assign > the helper if there is a discrepancy between the action's helper and the > current skb->nfct. > > Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action") > Signed-off-by: Joe Stringer > --- > net/openvswitch/conntrack.c | 12 > 1 file changed, 12 insertions(+) > > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c > index b5fea1101faa..89f61a1720eb 100644 > --- a/net/openvswitch/conntrack.c > +++ b/net/openvswitch/conntrack.c > @@ -776,6 +776,18 @@ static int __ovs_ct_lookup(struct net *net, struct > sw_flow_key *key, > return -EINVAL; > } > > + /* Userspace may decide to perform a ct lookup without a helper > + * specified followed by a (recirculate and) commit with one. > + * Therefore, for unconfirmed connections we need to attach the > + * helper here. > + */ > + if (!nf_ct_is_confirmed(ct) && info->helper && !nfct_help(ct)) { > + int err = __nf_ct_try_assign_helper(ct, info->ct, > + GFP_ATOMIC); > + if (err) > + return err; > + } > + > /* Call the helper only if: >* - nf_conntrack_in() was executed above ("!cached") for a >* confirmed connection, or > -- > 2.1.4 >
[PATCH net v3 1/2] udp_tunnel: Remove redundant udp_tunnel_gro_complete().
The setting of the UDP tunnel GSO type is already performed by udp[46]_gro_complete(). Signed-off-by: Jarno Rajahalme <ja...@ovn.org> --- drivers/net/geneve.c | 2 -- drivers/net/vxlan.c | 2 -- include/net/udp_tunnel.h | 9 - net/ipv4/fou.c | 2 -- 4 files changed, 15 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index bc16889..98f1224 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -504,8 +504,6 @@ static int geneve_gro_complete(struct sk_buff *skb, int nhoff, int gh_len; int err = -ENOSYS; - udp_tunnel_gro_complete(skb, nhoff); - gh = (struct genevehdr *)(skb->data + nhoff); gh_len = geneve_hlen(gh); type = gh->proto_type; diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 1c0fa36..dd2d032 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -616,8 +616,6 @@ out: static int vxlan_gro_complete(struct sk_buff *skb, int nhoff, struct udp_offload *uoff) { - udp_tunnel_gro_complete(skb, nhoff); - return eth_gro_complete(skb, nhoff + sizeof(struct vxlanhdr)); } diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h index b831140..a114024 100644 --- a/include/net/udp_tunnel.h +++ b/include/net/udp_tunnel.h @@ -106,15 +106,6 @@ static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb, return iptunnel_handle_offloads(skb, type); } -static inline void udp_tunnel_gro_complete(struct sk_buff *skb, int nhoff) -{ - struct udphdr *uh; - - uh = (struct udphdr *)(skb->data + nhoff - sizeof(struct udphdr)); - skb_shinfo(skb)->gso_type |= uh->check ? - SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL; -} - static inline void udp_tunnel_encap_enable(struct socket *sock) { #if IS_ENABLED(CONFIG_IPV6) diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c index a39068b..305d9ac 100644 --- a/net/ipv4/fou.c +++ b/net/ipv4/fou.c @@ -228,8 +228,6 @@ static int fou_gro_complete(struct sk_buff *skb, int nhoff, int err = -ENOSYS; const struct net_offload **offloads; - udp_tunnel_gro_complete(skb, nhoff); - rcu_read_lock(); offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[proto]); -- 2.7.4