Hi Ben, Let me try to explain the motivation of this patch. It actually fixes a severe design flaw in the native tunnel handling of OVS in the context of SDN-controlled cloud systems, which we found and had to fix downstream in our NFVI solution.
To implement tunnels (e.g. VXLAN, GRE) in the userspace datapath, OVS needs to have the MAC address of the IP nexthop on the underlay network (either the remote tunnel end-point in the case of an L2 underlay or a gateway in case of a L3 underlay). To this end OVS snoops passing ARP reply messages to build up an internal ARP cache. (I don't mention IPv6 ND in this mail, but the equivalent holds there as well). The current ARP snooping logic is broken because OVS simply snoops any ARP packet in any bridge after each flow/group table lookup: static void do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len, struct xlate_ctx *ctx, bool is_last_action) { struct flow_wildcards *wc = ctx->wc; struct flow *flow = &ctx->xin->flow; const struct ofpact *a; /* FIXME: This unconditional snooping in all bridges and every flow translation is bad. */ if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) { tnl_neigh_snoop(flow, wc, ctx->xbridge->name); } This happens even in overlay bridges carrying the tenant network traffic. What is worse is that OVS stores all IP-MAC address pairs in a single global per bridge, not considering VLANs or any other SDN-controlled network separation. As IP addresses are generally not unique across tenant networks in NFVI clouds, there are frequent collisions and these useless ARP cache entries are constantly overwritten. Every update in the ARP table is considered a switch configuration change and triggers the revalidation of all existing datapath flow cache entries, an expensive operation. The result is constant high ovs-vswitchd CPU load and even data plane disturbances. In the worst case the correct ARP entry for the tunnel nexthop can be overwritten, resulting in lost tunnel packets. This patch addresses the problem by: * Moving the tnl_neigh_snoop() call to the only place where ARP snooping makes sense: at output to the bridge's LOCAL port, as this is the only supported port for tunnel endpoints today. * Checking that the ARP reply is destined to one of the known IP addresses assigned to the bridge's LOCAL port before snooping. * As the actual tunnel endpoint is typically a LOCAL port of a NORMAL (non-SDN controlled) underlay bridge (e.g. br-prv), the VLAN filtering is implicitly done by the NORMAL action, so that only packets in the LOCAL port's VLAN are seen by the output function. * On the SDN-controlled bridges that handle tenant networking ("br-int"), the LOCAL port does not participate in traffic. No tenant packets will be seen by the tunnel neighbor snooping, so that frequent collisions are totally avoided. I hope that clarifies the big picture. I leave the more technical aspects of the data structures to Zoltan. @Zoltan: Can you include this explanation into the commit message make sure the code is commented as well in the critical places? Regards, Jan > I don't understand yet the motivation here. What does it mean "to keep > ARP neighbor cache clean"? Is this a bug fix, a performance fix, a > cleanup, or something else? > > What is the goal of the filtering you mention? Is that a second change, > that can and should be a separate patch, or is it closely coupled to the > first change you mention? > > This gives me the following Clang errors: > > ../ofproto/ofproto-dpif-xlate.c:3749:13: error: cast from 'const uint8_t > *' (aka 'const unsigned char *') to 'const struct in6_addr *' > increases required alignment from 1 to 4 [-Werror,-Wcast-align] > /usr/include/netinet/in.h:454:36: note: expanded from macro > 'IN6_ARE_ADDR_EQUAL' > ../ofproto/ofproto-dpif-xlate.c:3749:13: error: cast from 'const uint8_t > *' (aka 'const unsigned char *') to 'const struct in6_addr *' > increases required alignment from 1 to 4 [-Werror,-Wcast-align] > /usr/include/netinet/in.h:455:36: note: expanded from macro > 'IN6_ARE_ADDR_EQUAL' > > The change to tnl-neigh-cache.c that just adds an #include directive > seems odd to me. > > The treatment of the new xbridge_addr is different from the treatment of > everything already handled in xlate_ofproto_set(). Why does it need > special treatment? Is it RCU safe? Why does it need a refcount (none > of the other structs do)? > > Thanks, > > Ben. > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev