On Tue, Mar 12, 2013 at 07:44:10PM -0700, Ethan Jackson wrote:
> Traditionally, Open vSwitch has used a variant of 802.1ag "CFM" for
> interface liveness detection. This has served us well until now,
> but has several serious drawbacks which have steadily become more
> inconvenient. First, the 802.1ag standard does not implement
> several useful features forcing us to (optionally) break
> compatibility. Second, 802.1.ag is not particularly popular
> outside of carrier grade networking equipment. Third, 802.1ag is
> simply quite awkward.
>
> In an effort to solve the aforementioned problems, this patch
> implements BFD which is ubiquitous, well designed, straight
> forward, and implements required features in a standard way. The
> initial cut of the protocol focuses on getting the basics of the
> specification correct, leaving performance optimizations, and
> advanced features as future work. The protocol should be
> considered experimental pending future testing.
>
> Signed-off-by: Ethan Jackson <[email protected]>
bfd.c
-----
> +#include <config.h>
> +#include <bfd.h>
We usually use double quotes ("bfd.h") for our own headers.
> +/* XXX Finish Experimnetal BFD support.
What part is experimental? (The protocol or the implementation?)
> + * - CFM random VLAN option equivalent. Instead of vlan, randomly modulate
> the
> + * UDP source port to get the required entropy.
Ugh. Do we really need random anything? I've never liked that
"feature".
> +#define BFD_VERSION 1
I think the following is microseconds, could you add a comment /*
microseconds */, put _USEC in the name, etc.?
> +#define OVS_BFD_MIN_RX 200000
I'd be half-inclined to write the states as 0 << 6, 1 << 6, 2 << 6, 3
<< 6. But it doesn't really matter:
> +enum state {
> + STATE_ADMIN_DOWN = 0,
> + STATE_DOWN = 1,
> + STATE_INIT = 2,
> + STATE_UP = 3
> +};
...
> +/* RFC 5880 Section 4.1
I think you left out the line that gives the first digits of the
two-digit numbers here. It's harder to read without them:
> + * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + * |Vers | Diag |Sta|P|F|C|A|D|M| Detect Mult | Length |
> + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + * | My Discriminator |
> + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + * | Your Discriminator |
> + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + * | Desired Min TX Interval |
> + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + * | Required Min RX Interval |
> + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> + * | Required Min Echo RX Interval |
> + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ */
The BFD timers are 32-bit counts of microseconds. The timers in
struct bfd are 64-bit and appear to be milliseconds, but I'm not
certain that all of them have the same unit. Comments on units would
be welcome. Do they need to be 64-bit?
...
> +/* Returns a 'smap' of key value pairs representing the status of 'bfd'
> + * intended for the OVS database. */
> +void
> +bfd_get_status(const struct bfd *bfd, struct smap *smap)
> +{
> + smap_add(smap, "forwarding", bfd_forwarding(bfd) ? "true" : "false");
> + smap_add(smap, "state", bfd_state_str(bfd->state));
> + smap_add(smap, "diagnostic", bfd_diag_str(bfd->diag));
> +
> + if (bfd->state != STATE_DOWN) {
Spaces are OK in the OVSDB protocol but I don't think we use them
much. Usually we use hyphen or underscores (probably should
standardize that in fact):
> + smap_add(smap, "remote state", bfd_state_str(bfd->rmt_state));
> + smap_add(smap, "remote diagnostic", bfd_diag_str(bfd->rmt_diag));
> + }
> +}
The interface for bfd_configure() is unusual. It does seem to be
appropriate.
bfd_configure() checks for a key "enable" in the configuration that is
passed in, but I don't see any documentation for that key.
This might be a personal problem, but I find the following:
> + udp_src++;
> + udp_src = udp_src < 49152 ? 49152 : udp_src;
> + bfd->udp_src = udp_src;
a little harder to verify as correct than:
bfd->udp_src = (udp_src++ & 16383) + 49152;
The following, and similar, bugs me a bit because it can call
smap_get_int() twice due to the macro expansion:
> + min_tx = MAX(smap_get_int(cfg, "min_tx", 100), 100);
> +void
> +bfd_run(struct bfd *bfd)
> +{
> + if (bfd->state > STATE_DOWN && time_msec() > bfd->detect_time) {
The test above should probably be >= bfd->detect_time or we should
wait until detect_time + 1 in bfd_wait(), otherwise we could wake up 1
ms too early.
> + bfd_set_state(bfd, STATE_DOWN, DIAG_EXPIRED);
> + }
> +
> + if (bfd->min_tx != bfd->cfg_min_tx || bfd->min_rx != bfd->cfg_min_rx) {
> + bfd_poll(bfd);
> + }
> +}
...
In bfd_put_packet(), it might be wise to start out with
ofpbuf_reserve(p, 2); to properly align what comes afterward.
> + eth = ofpbuf_put_uninit(p, sizeof *eth);
> + memcpy(eth->eth_dst, eth_addr_broadcast, sizeof eth->eth_dst);
> + memcpy(eth->eth_src, eth_src, ETH_ADDR_LEN);
It would be nice to use sizeof in both cases or neither, above.
> + eth->eth_type = htons(ETH_TYPE_IP);
> +
> + ip = ofpbuf_put_zeros(p, sizeof *ip);
> + ip->ip_ihl_ver = IP_IHL_VER(5, 4);
> + ip->ip_tot_len = htons(sizeof *ip + sizeof *udp + sizeof(struct msg));
> + ip->ip_ttl = 1;
I understand why we're using ip_ttl == 1, but I wonder whether you
considered following RFC 5082 and therefore using (and checking for)
ip_ttl == 255.
> + ip->ip_proto = IPPROTO_UDP;
> + ip->ip_src = htonl(0xA9FE0100); /* 169.254.1.0 Link Local. */
> + ip->ip_dst = htonl(0xA9FE0101); /* 169.254.1.1 Link Local. */
> + ip->ip_csum = csum(ip, sizeof *ip);
> +
> + udp = ofpbuf_put_zeros(p, sizeof *udp);
> + udp->udp_src = htons(bfd->udp_src);
> + udp->udp_dst = htons(BFD_DEST_PORT);
> + udp->udp_len = htons(sizeof *udp + sizeof(struct msg));
I'd be inclined to write "sizeof *msg" instead of "sizeof(struct
msg)" above.
> + msg = ofpbuf_put_uninit(p, sizeof(struct msg));
Here too.
It looks like bfd_process_packet() expects the caller to have pulled
off everything except the L7 payload. It would be nice to put this in
a comment.
I don't see a check for this requirement from 6.8.6 (I think it only
requires comparing p->size to msg->length):
If the Length field is greater than the payload of the
encapsulating protocol, the packet MUST be discarded.
The pseudocode at the end of 6.8.6 only mentions updating
bfd.LocalDiag in a couple of cases, but the implementation appears to
update LocalDiag in every case where it changes the session state.
I don't see anything in bfd_process_packet() that corresponds to:
If the packet was not discarded, it has been received for purposes
of the Detection Time expiration rules in section 6.8.4.
but maybe I just missed it.
Perhaps the log message in bfd_poll() should include bfd->name.
In bfd_min_tx(), I'd be inclined to remove the outer () here:
> + return (bfd->state == STATE_UP ? bfd->min_tx : MAX(bfd->min_tx, 1000));
In bfd_tx_interval(), I'm not wild about putting MAX around a function
call here either.
In bfd_flag_str(), please use ovs_strlcpy() instead of strncpy().
There is occasionally a good reason to use strncpy(). However:
* Using strncpy() into a large buffer can be very inefficient.
strncpy() always writes to every byte in the destination
buffer, which can waste a lot of time if the destination
buffer is much longer than the source string.
* If the source string is longer than the size of the
destination buffer, then strncpy() doesn't write a
terminating null. So a call to strncpy() must be followed
by explicitly writing a null terminator at the end of the
destination buffer in most cases.
bfd_flag_str() could return a const string.
> +static void
> +log_msg(enum vlog_level level, const struct msg *p, const char *message,
> + const struct bfd *bfd)
> +{
> + struct ds ds = DS_EMPTY_INITIALIZER;
> +
> + if (level != VLL_DBG && vlog_should_drop(THIS_MODULE, level, &rl)) {
If we really don't want to rate-limit debug messages, then we should
at least check for VLOG_IS_DBG_ENABLED() in that case, so that in the
common case we don't spend a lot of time composing a message that will
just be discarded later.
It might be a good idea to rate-limit debug messages anyway, because
these days we have an ovs-appctl command to disable rate limits on
particular modules.
Would there be any value in showing the full times in microseconds?
Occasionally I find that every detail is valuable in debugging.
The implementation of generate_discriminator() could be expensive if
we have thousands of BFD instances (which I know we will have). Would
it be worthwhile to use a global discriminator-based hmap instead of a
global list to hold the BFD instances?
> +static long long int
> +msec_diff(long long int earlier, long long int later) {
We usually put the { on a line by itself.
> + later -= earlier;
> + return later < 0 ? 0 : later;
> +}
The following loses a bit of information by truncating to 0. Did you
consider a format like "now%+lld ms" so that you get output like
"now+5 ms" or "now-10 ms"?
> + ds_put_format(ds, "\tDetect Time: %lldms from now\n",
> + msec_diff(time_msec(), bfd->detect_time));
> + ds_put_format(ds, "\tNext TX Time: %lldms from now\n",
> + msec_diff(time_msec(), bfd->next_tx));
> + ds_put_format(ds, "\tLast TX Time: %lldms ago\n",
> + msec_diff(bfd->last_tx, time_msec()));
bfd.h
-----
I think that the parameter to bfd_should_send_packet() could be 'const':
> +bool bfd_should_send_packet(struct bfd *);
odp-util.h
----------
I think this comment can now add SLOW_BFD:
/* Mutually exclusive with SLOW_CFM, SLOW_LACP, SLOW_STP.
* Could possibly appear with SLOW_IN_BAND. */
ofproto-dpif.c
--------------
Why do we now need to call time_refresh() in run_fast()?
vswitch.xml
-----------
Should we mention RFC 5881 also?
"three times the expected reception rate" is the way we configure BFD,
I guess, but it's not protocol-specified.
"aims comply" => "aims to comply":
The Open vSwitch implementation of BFD aims comply faithfully with
Many of the column specifications could be more specific. For
example:
<column name="bfd" key="min_rx"
type='{"type": "integer", "minInteger": 1}'>
and
<column name="bfd_status" key="state">
type='{"type": "string",
"enum": ["set", ["UP", "DOWN", "INIT", "ADMIN_DOWN"]]}'>
We usually use quotes ``like this'' in vswitch.xml, here:
> + omission is "Demand Mode", which we hope to include in future.
> + Open vSwitch does not implement the optional Authentication or
and here:
> + "Echo Mode" features.
> + </p>
> + <column name="bfd" key="min_rx">
I might say "fastest" instead of "minimum" here:
> + The minimum rate, in milliseconds, at which this BFD session is
may "be" slower:
> + willing to receive BFD control messages. The actual rate may
> slower
> + if the remote endpoint isn't willing to transmit as quickly as
> + specified. Defaults to <code>1000</code>.
> + </column>
> +
> + <column name="bfd" key="min_tx">
"fastest" here?
> + The minimum rate, in milliseconds, at which this BFD session is
> + willing to transmit BFD control messages. The actual rate may be
> + slower if the remote endpoint isn't willing to receive as quickly
> as
> + specified. Defaults to <code>100</code>.
> + </column>
> +
> + <column name="bfd" key="cpath_down">
The following description doesn't actually tell the user how to use
cpath_down. Rephrase slightly?
> + Concatenated path down may be used when the local system should not
> + have traffic forwarded to it for some reason other than a
> connectivty
> + failure on the interface being monitored. The local BFD session
> will
> + notify the remote session of the connectivity problem, at which
> time
> + the remote session may choose not to forward traffic. Defaults to
> + <code>false</code>.
> + </column>
> +
> + <column name="bfd_status" key="state">
Any reason to make the state values all-uppercase? We tend to use
lowercase elsewhere in the db.
> + State of the BFD session. One of <code>ADMIN_DOWN</code>,
> + <code>DOWN</code>, <code>INIT</code>, or <code>UP</code>.
> + </column>
> +
> + <column name="bfd_status" key="forwarding">
> + True if the BFD session believes this <ref table="Interface"/> may
> be
> + used to forward traffic. Typically this means the local session is
"signaling":
> + up, and the remote system isn't signalling a problem such as
> + concatenated path down.
> + </column>
Thanks,
Ben.
_______________________________________________
dev mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/dev