On Fri, Apr 22, 2016 at 03:56:52PM -0500, Ryan Moats wrote: > Ben Pfaff <[email protected]> wrote on 04/22/2016 03:06:16 PM: > > > From: Ben Pfaff <[email protected]> > > To: Ryan Moats/Omaha/IBM@IBMUS > > Cc: ovs dev <[email protected]> > > Date: 04/22/2016 03:06 PM > > Subject: Re: [ovs-dev] [RFC] Idea for fixing "raceful" E2E ovn tests > > > > On Thu, Apr 14, 2016 at 08:37:27AM -0500, Ryan Moats wrote: > > > I've pretty much become fed up with the "raceful" nature of the E2E > > > ovn test cases and so I've set my self the goal of fixing them. > > > > > > After some thought last night, I *think* I might have found a way > > > to do it. Now, since I'm not 100% that my idea is the cleanest way > > > to fix things, I thought I'd throw the idea out on the table first > > > and get some feedback before I go off and write code. > > > > > > I'm thinking of the following changes: > > > > > > in ovn/controller/ofctrl.c, change ofctrl_put to return a boolean > > > (true if a flow is queue, false otherwise). > > > > > > in ovn/controller/ovn-controller.c: > > > 1. add a command line argument (--unit-test). > > > 2. if ofctrl_put return true and the above command line argument > > > is specified, dump a line to the log file saying that ovn-controller > > > hasn't made any OF changes in the last loop > > > > > > in the ovn E2E tests cases: > > > 1. start ovn-controller with the --unit-test argument > > > 2. instead of sleep 1 for waiting for ovn-northd/ovn-controller to > > > quiesce, look at the tail of the ovn-controller logs for two > > > consecutive loops where ovn-controller hasn't made any OF changes > > > in the last loop > > > > I've had a different idea for how to do this for a while now. Let me > > see... > > > > You might not be familiar with how ovs-vsctl waits until the changes > > that it makes to the Open vSwitch configuration take effect before > > exiting. It uses a sequence number protocol in the Open_vSwitch > > database table. This table (which has exactly one row) has two integer > > columns, named cur_cfg and next_cfg. Initially cur_cfg and next_cfg are > > both zero. When ovs-vsctl modifies the configuration, it also > > atomically increments next_cfg. Before ovs-vswitchd reconfigures > > itself, it reads next_cfg, then when it finishes it sets cur_cfg to the > > next_cfg value that it read. ovs-vsctl waits until cur_cfg is greater > > than or equal to the next_cfg that it set before it exits. > > > > This has a number of nice properties. It has low overhead in time and > > space, it works efficiently with any number of writers, and later on it > > is easy to observe what happened and when by looking at the database > > log. > > > > My thought was to extend this concept to the OVN distributed system. > > For example: > > > > * Add a next_cfg value somewhere in the OVN_Northbound > > database. When ovn-nbctl or something else updates the > > database and wants to allow for finding out where the updates > > are propagated, it increments next_cfg. > > > > * Add a similar next_cfg somewhere in OVN_Southbound, and make > > ovn-northd propagate the value from northbound to southbound. > > > > * Add a cur_cfg column to the Chassis table in OVN_Southbound. > > When an ovn-controller brings its chassis up to date with a > > given configuration, it stores the next_cfg value that it just > > got up-to-date with into its own cur_cfg. > > > > * In theory ovn-northd could propagate the current minimum value > > of cur_cfg back to the northbound database, if that's useful. > > In a real system with constant failures though it's hard for > > me to guess whether it's realistic. > > > > This could even be used such that ovn-nbctl waits for the whole > > distributed system to catch up before it exits. > > > > What are your thoughts? > > This is definitely an interesting idea, but I'm not convinced of the last > two items. I agree that while having ovn-northd propagate the current > minimum value of cur_cfg up is good in theory, the idea of blocking > ovn-nbctl from exiting until everything catches up worries me. For > example, > what happens if the SB Chassis table has an orphaned entry due to error > conditions? Does that lead to the provisioning plane shutting down because > ovn-nbctl never exits? > > I think I'd rather see ovn-nbctl exit immediately upon making changes and > we provide an additional commands to ovn-nbctl that either returns a status > code based on whether the system is up to date or not and/or the values of > the next_cfg and cur_cfg columns (I assume that ovn-sbctl will show the > next_cfg and cur_cfg for chassis if that table is shown). That way we can > continue making progress in the case of corner failure cases and have a > way to identify which row(s) in the chassis table form the source of a > problem. > > Thus (to close the loop), the test case would go ahead and make the needed > changes with ovn-nbctl and then have a gate where it waits until ovn-nbctl > reports that the system is up to date, at which point the test packets can > be sent.
The default for ovs-vsctl is to wait, and one can override it with --no-wait. It would be fine with me for the default for ovn-nbctl to be not to wait and to provide a --wait option if one does want to wait. I agree that a large, real system is going to have failures that mean that it may be impractical to wait by default, but I also think that the ability to wait is going to be valuable for unit tests especially. _______________________________________________ dev mailing list [email protected] http://openvswitch.org/mailman/listinfo/dev
