On Fri, Apr 22, 2016 at 03:56:52PM -0500, Ryan Moats wrote:
> Ben Pfaff <b...@ovn.org> wrote on 04/22/2016 03:06:16 PM:
> 
> > From: Ben Pfaff <b...@ovn.org>
> > To: Ryan Moats/Omaha/IBM@IBMUS
> > Cc: ovs dev <dev@openvswitch.org>
> > Date: 04/22/2016 03:06 PM
> > Subject: Re: [ovs-dev] [RFC] Idea for fixing "raceful" E2E ovn tests
> >
> > On Thu, Apr 14, 2016 at 08:37:27AM -0500, Ryan Moats wrote:
> > > I've pretty much become fed up with the "raceful" nature of the E2E
> > > ovn test cases and so I've set my self the goal of fixing them.
> > >
> > > After some thought last night, I *think* I might have found a way
> > > to do it.  Now, since I'm not 100% that my idea is the cleanest way
> > > to fix things, I thought I'd throw the idea out on the table first
> > > and get some feedback before I go off and write code.
> > >
> > > I'm thinking of the following changes:
> > >
> > > in ovn/controller/ofctrl.c, change ofctrl_put to return a boolean
> > > (true if a flow is queue, false otherwise).
> > >
> > > in ovn/controller/ovn-controller.c:
> > > 1. add a command line argument (--unit-test).
> > > 2. if ofctrl_put return true and the above command line argument
> > >    is specified, dump a line to the log file saying that ovn-controller
> > >    hasn't made any OF changes in the last loop
> > >
> > > in the ovn E2E tests cases:
> > > 1. start ovn-controller with the --unit-test argument
> > > 2. instead of sleep 1 for waiting for ovn-northd/ovn-controller to
> > >    quiesce, look at the tail of the ovn-controller logs for two
> > >    consecutive loops where ovn-controller hasn't made any OF changes
> > >    in the last loop
> >
> > I've had a different idea for how to do this for a while now.  Let me
> > see...
> >
> > You might not be familiar with how ovs-vsctl waits until the changes
> > that it makes to the Open vSwitch configuration take effect before
> > exiting.  It uses a sequence number protocol in the Open_vSwitch
> > database table.  This table (which has exactly one row) has two integer
> > columns, named cur_cfg and next_cfg.  Initially cur_cfg and next_cfg are
> > both zero.  When ovs-vsctl modifies the configuration, it also
> > atomically increments next_cfg.  Before ovs-vswitchd reconfigures
> > itself, it reads next_cfg, then when it finishes it sets cur_cfg to the
> > next_cfg value that it read.  ovs-vsctl waits until cur_cfg is greater
> > than or equal to the next_cfg that it set before it exits.
> >
> > This has a number of nice properties.  It has low overhead in time and
> > space, it works efficiently with any number of writers, and later on it
> > is easy to observe what happened and when by looking at the database
> > log.
> >
> > My thought was to extend this concept to the OVN distributed system.
> > For example:
> >
> >         * Add a next_cfg value somewhere in the OVN_Northbound
> >           database.  When ovn-nbctl or something else updates the
> >           database and wants to allow for finding out where the updates
> >           are propagated, it increments next_cfg.
> >
> >         * Add a similar next_cfg somewhere in OVN_Southbound, and make
> >           ovn-northd propagate the value from northbound to southbound.
> >
> >         * Add a cur_cfg column to the Chassis table in OVN_Southbound.
> >           When an ovn-controller brings its chassis up to date with a
> >           given configuration, it stores the next_cfg value that it just
> >           got up-to-date with into its own cur_cfg.
> >
> >         * In theory ovn-northd could propagate the current minimum value
> >           of cur_cfg back to the northbound database, if that's useful.
> >           In a real system with constant failures though it's hard for
> >           me to guess whether it's realistic.
> >
> > This could even be used such that ovn-nbctl waits for the whole
> > distributed system to catch up before it exits.
> >
> > What are your thoughts?
> 
> This is definitely an interesting idea, but I'm not convinced of the last
> two items.  I agree that while having ovn-northd propagate the current
> minimum value of cur_cfg up is good in theory, the idea of blocking
> ovn-nbctl from exiting until everything catches up worries me.  For
> example,
> what happens if the SB Chassis table has an orphaned entry due to error
> conditions?  Does that lead to the provisioning plane shutting down because
> ovn-nbctl never exits?
> 
> I think I'd rather see ovn-nbctl exit immediately upon making changes and
> we provide an additional commands to ovn-nbctl that either returns a status
> code based on whether the system is up to date or not and/or the values of
> the next_cfg and cur_cfg columns (I assume that ovn-sbctl will show the
> next_cfg and cur_cfg for chassis if that table is shown). That way we can
> continue making progress in the case of corner failure cases and have a
> way to identify which row(s) in the chassis table form the source of a
> problem.
> 
> Thus (to close the loop), the test case would go ahead and make the needed
> changes with ovn-nbctl and then have a gate where it waits until ovn-nbctl
> reports that the system is up to date, at which point the test packets can
> be sent.

The default for ovs-vsctl is to wait, and one can override it with
--no-wait.  It would be fine with me for the default for ovn-nbctl to be
not to wait and to provide a --wait option if one does want to wait.  I
agree that a large, real system is going to have failures that mean that
it may be impractical to wait by default, but I also think that the
ability to wait is going to be valuable for unit tests especially.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to