Hi Ben,

Let me try to explain the motivation of this patch. It actually fixes a severe 
design flaw in the native tunnel handling of OVS in the context of 
SDN-controlled cloud systems, which we found and had to fix downstream in our 
NFVI solution.

To implement tunnels (e.g. VXLAN, GRE) in the userspace datapath, OVS needs to 
have the MAC address of the IP nexthop on the underlay network (either the 
remote tunnel end-point in the case of an L2 underlay or a gateway in case of a 
L3 underlay). To this end OVS snoops passing ARP reply messages to build up an 
internal ARP cache. (I don't mention IPv6 ND in this mail, but the equivalent 
holds there as well).

The current ARP snooping logic is broken because OVS simply snoops any ARP 
packet in any bridge after each flow/group table lookup:

static void
do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
                 struct xlate_ctx *ctx, bool is_last_action)
{
    struct flow_wildcards *wc = ctx->wc;
    struct flow *flow = &ctx->xin->flow;
    const struct ofpact *a;

    /* FIXME: This unconditional snooping in all bridges and every flow 
translation is bad. */
    if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
        tnl_neigh_snoop(flow, wc, ctx->xbridge->name);
    }

This happens even in overlay bridges carrying the tenant network traffic. What 
is worse is that OVS stores all IP-MAC address pairs in a single global per 
bridge, not considering VLANs or any other SDN-controlled network separation. 
As IP addresses are generally not unique across tenant networks in NFVI clouds, 
there are frequent collisions and these useless ARP cache entries are 
constantly overwritten.

Every update in the ARP table is considered a switch configuration change and 
triggers the revalidation of all existing datapath flow cache entries, an 
expensive operation. The result is constant high ovs-vswitchd CPU load and even 
data plane disturbances. In the worst case the correct ARP entry for the tunnel 
nexthop can be overwritten, resulting in lost tunnel packets.

This patch addresses the problem by: 

* Moving the tnl_neigh_snoop() call to the only place where ARP snooping makes 
sense: at output to the bridge's LOCAL port, as this is the only supported port 
for tunnel endpoints today.

* Checking that the ARP reply is destined to one of the known IP addresses 
assigned to the bridge's LOCAL port before snooping.

* As the actual tunnel endpoint is typically a LOCAL port of a NORMAL (non-SDN 
controlled) underlay bridge (e.g. br-prv), the VLAN filtering is implicitly 
done by the NORMAL action, so that only packets in the LOCAL port's VLAN are 
seen by the output function.

* On the SDN-controlled bridges that handle tenant networking ("br-int"), the 
LOCAL port does not participate in traffic. No tenant packets will be seen by 
the tunnel neighbor snooping, so that frequent collisions are totally avoided.

I hope that clarifies the big picture. I leave the more technical aspects of 
the data structures to Zoltan.

@Zoltan: Can you include this explanation into the commit message make sure the 
code is commented as well in the critical places?

Regards, Jan

> I don't understand yet the motivation here.  What does it mean "to keep
> ARP neighbor cache clean"?  Is this a bug fix, a performance fix, a
> cleanup, or something else?
> 
> What is the goal of the filtering you mention?  Is that a second change,
> that can and should be a separate patch, or is it closely coupled to the
> first change you mention?
> 
> This gives me the following Clang errors:
> 
>     ../ofproto/ofproto-dpif-xlate.c:3749:13: error: cast from 'const uint8_t 
> *' (aka 'const unsigned char *') to 'const struct in6_addr *'
> increases required alignment from 1 to 4 [-Werror,-Wcast-align]
>     /usr/include/netinet/in.h:454:36: note: expanded from macro 
> 'IN6_ARE_ADDR_EQUAL'
>     ../ofproto/ofproto-dpif-xlate.c:3749:13: error: cast from 'const uint8_t 
> *' (aka 'const unsigned char *') to 'const struct in6_addr *'
> increases required alignment from 1 to 4 [-Werror,-Wcast-align]
>     /usr/include/netinet/in.h:455:36: note: expanded from macro 
> 'IN6_ARE_ADDR_EQUAL'
> 
> The change to tnl-neigh-cache.c that just adds an #include directive
> seems odd to me.
> 
> The treatment of the new xbridge_addr is different from the treatment of
> everything already handled in xlate_ofproto_set().  Why does it need
> special treatment?  Is it RCU safe?  Why does it need a refcount (none
> of the other structs do)?
> 
> Thanks,
> 
> Ben.
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to