On Wed, Jun 24, 2015 at 3:57 PM, Ben Pfaff <b...@nicira.com> wrote:
> On Fri, Jun 19, 2015 at 04:13:24PM -0700, Jesse Gross wrote:
>> The current support for Geneve in OVS is exactly equivalent to VXLAN:
>> it is possible to set and match on the VNI but not on any options
>> contained in the header. This patch enables the use of options.
>>
>> The goal for Geneve support is not to add support for any particular option
>> but to allow end users or controllers to specify what they would like to
>> match. That is, the full range of Geneve's capabilities should be exposed
>> without modifying the code (the one exception being options that require
>> per-packet computation in the fast path).
>>
>> The main issue with supporting Geneve options is how to integrate the
>> fields into the existing OpenFlow pipeline. All existing operations
>> are referred to by their NXM/OXM field name - matches, action generation,
>> arithmetic operations (i.e. tranfer to a register). However, the Geneve
>> option space is exactly the same as the OXM space, so a direct mapping
>> is not feasible. Instead, we create a pool of 64 NXMs that are then
>> dynamically mapped on Geneve option TLVs using OpenFlow. Once mapped,
>> these fields become first-class citizens in the OpenFlow pipeline.
>>
>> An example of how to use Geneve options:
>> ovs-ofctl add-geneve-map br0 {class=0xffff,type=0,len=4}->tun_metadata0
>> ovs-ofctl add-flow br0 
>> in_port=LOCAL,actions=set_field:0xffffffff->tun_metadata0,1
>>
>> This will add a 4 bytes option (filled will all 1's) to all packets
>> coming from the LOCAL port and then send then out to port 1.
>>
>> A limitation of this patch is that although the option table is specified
>> for a particular switch over OpenFlow, it is currently global to all
>> switches. This will be addressed in a future patch.
>
> Out of curiosity, when you say that do you mean "This is left as future
> work" or "I'm cleaning up a patch as we speak" or somewhere in between?

Somewhere in between. Madhu said that he would take a look at it,
although I guess that it is still early. I have a list of things that
I want to at least consider improving but that I deemed not critical
to the initial implementation and pushed off to avoid further
expanding the series, so there will be some more enhancements coming.
I plan to address all of these before 2.5 is released, at a minimum.

> Building on i386, this makes the compiler really angry at me because
> struct tun_metadata is an odd multiple of 32 bits in size, which makes
> struct flow an odd multiple of 32 bits in size and fails the flow.h
> assertion
> BUILD_ASSERT_DECL(sizeof(struct flow) % sizeof(uint64_t) == 0);
>
> This fixes it but obviously it's not the right fix:
>
> diff --git a/lib/tun-metadata.h b/lib/tun-metadata.h
> index 8ba0757..4e1e84a 100644
> --- a/lib/tun-metadata.h
> +++ b/lib/tun-metadata.h
> @@ -37,6 +37,7 @@ struct tun_metadata {
>      uint8_t opts[TUN_METADATA_TOT_OPT_SIZE];
>      uint64_t opt_map;
>      struct tun_table *tab;
> +    uint8_t pad[4];
>  };
>  BUILD_ASSERT_DECL(sizeof(((struct tun_metadata *)0)->opt_map) * 8 >=
>                    TUN_METADATA_NUM_OPTS);

I ended up with this:
    uint8_t pad[sizeof(uint64_t) - sizeof(struct tun_table *)]; /*
Make 8 bytes */

Ideally I would just use __attribute__((aligned(8))) but that's not
very portable.

> I see a lot of use of "1 << (something)" in tun-metadata, but I believe
> that these need to use 1ULL or UINT64_C(1) instead.

Fixed.

> I'm a little confused about the alloc_offset field in struct
> tun_metadata_allocation.  It seems to be a byte offset but
> metadata_loc_from_match() compares it against TUN_METADATA_NUM_OPTS.  Is
> that a typo for TUN_METADATA_TOT_OPT_SIZE?

It's a dumb typo - it should be TUN_METADATA_TOT_OPT_SIZE.

> tun-metadata.c has a few hard tabs.

Fixed.

> Fragmented tunnel metadata is nasty.  Is there a way we can avoid it?

I think (hope) that it is relatively uncommon that it will come up in
practice. However, I don't think that it's possible to avoid
supporting it at all. The alternatives are:
 * Don't support it at all. I don't think this is an acceptable
cop-out because it would mean that controllers have no real idea of
knowing what options they can use. Setting up new options would seem
to non-deterministically fail and the only mechanism that they would
have to fix it is to clear both the flow table and option table and
start over.
 * Compact the option table to remove holes. This would be a little
faster on flow setup because we wouldn't have to traverse the chains
of fragments but I think the code is much nastier. We would have to
update all the flows in the classifier to match the new option table.
It doesn't seem like it is worth it at this point in time.

> I like the implementation approach used in ULONG_FOR_EACH_1 better than
> the one in PRESENT_OPT_FOR_EACH, since the user doesn't have to provide
> a scratch variable.  Also it protects the macro arguments better.
> Actually ULONG_FOR_EACH_1 could be made to work with 32- or 64-bit maps
> just by changing "unsigned long" to "uint64_t" in its definition; maybe
> we should.

I looked at converting ULONG_FOR_EACH_1 but decided against it since
it would introduce an additional branch (to decide whether to use
__builtin_ctz or __builtin_ctz_ll) and it is used in places that are
very careful about that kind thing, like the DPDK datapath. I just
adopted the logic instead in PRESENT_OPT_FOR_EACH.

> I'm not certain whether the tun_metadata layer intends to limit itself
> to option values that are a multiple of 4 bytes (as Geneve has) or
> whether it's supposed to be an abstraction that doesn't care.

In theory it should be relatively generic but in practice, I strongly
suspect that other types of tunnel metadata that we might have in the
future will also be multiples of 4 bytes. If we have something that
has a different requirement, I think we can figure out what we need
then (although I seem to be the king of protocols that push the
boundaries on things like this, so it seems unlikely someone else will
design a protocol that does that...).

> tun_meta_find_key() could probably be a tiny bit faster by switching
> from HMAP_FOR_EACH_WITH_HASH to HMAP_FOR_EACH_IN_BUCKET.

That's true, it's no faster to compare the hash than the actual key.

> I spent some of my time studying the code by annotating it with
> comments.  I'm appending a patch, hope you'll adopt a lot of it.

Thanks a lot - I obviously should have commented this better. I
checked it for correctness (it looks fine to me) and applied the whole
thing with just a few minor tweaks.

> Acked-by: Ben Pfaff <b...@nicira.com>

Thanks a lot for all of the reviews (it turns out userspace has become
a lot more complicated since the last time that I did any serious work
on it...). I'll wait a bit before pushing to look it over and give you
a chance to object to anything I said. I have the patch for Geneve
option support in the userspace datapath ready to go out once this in
to complete the initial needs for OVN.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to